-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add multi-field assignment helpers in np.lib.recfunctions #11526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If we are going to support both, would it not be more compatible to add new functions for views and keep the old behavior by default? |
numpy/lib/recfunctions.py
Outdated
n_elem = sum(f[2] for f in fields) | ||
|
||
if dtype is None: | ||
out_dtype = np.result_type(*[f[1].base for f in fields]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you pick between 'result_type' and not 'common_type'? It would be nice to justify the choice between them, even if the rationale is arbitrary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
common_type
will convert a dtype with all-integer fields to float, which doesn't seem right. result_type
preserves the int type in that case.
numpy/lib/recfunctions.py
Outdated
fields.extend(_get_fields_and_offsets(field[0], field[1] + offset)) | ||
return fields | ||
|
||
def structured_to_unstructured(arr, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer an implementation that just does arr.astype(replace_all_fields(arr.dtype, dtype)).view(dtype)
, since then it could take copy
and casting
parameters to forward to astype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
As a bonus, the function now returns a view instead of a copy if all the fields have the same dtype, which is the use-case people said they used on the mailing list.
@charris, that's still a possibility. I definitely don't mean to push these structured array changes without some support, or at least passive agreement, from the numpy developer community, especially if there is a backcompat break. #6053 actually makes two, somewhat independent, changes: 1. It makes multifield indexing return a view, and 2. it makes multifield assignment work "by position". In this PR the methods We can definitely go back one either or both of those. Let me try exploring that in another PR before finishing this one. |
Just to be clear, numpy 1.15 will still be back compatible? Or, put differently, if we broke backward compatibility, when did we do it? |
Remember there are two behavior changes: For the "multifield indexing returns a view" behavior, we put it off until 1.16, so nothing is broken yet. For the "copy-by-position" behavior, we already broke back-compatibility in 1.14. |
@ahaldane Does this still need consideration for 1.15.x? |
@ahaldane What is the status of all this stuff? |
Sorry, I've been out/busy! I'm hoping to be more active soon... I think the status is that there is still some hesitation about the behavior changes I wanted for structured arrays. There are two alternative proposals:
Here is the central reason for hesitation: Both option 1 and 2 break some compatibility with 1.14, because 1.14 was inconsistent with itself so any fix would have to break something. However, it may be the case that we break fewer people's code with option 2. On the other hand, I find option 1 much more elegant and gives more reasonable future behavior. |
Is this ready to hit the mailing list for discussion? |
Adds helper functions for the copy->view transition for multi-field indexes. Adds `structured_to_unstructured`, `apply_along_fields`, `assign_fields_by_name`, `require_fields`.
4297d69
to
a868e92
Compare
Fixed up based on latest comments. I think some discussion of the situation is still necessary, and it is on the agenda for the next status meeting. In preparation I've written up a draft of the structured array change proposal at https://gist.github.com/ahaldane/6cd44886efb449f9c8d5ea012747323b Looking over it again now, my feeling is we should go ahead with this PR an all of the proposed changes, despite some back-compat breaks. |
The mailing list / design document met with little dissent. ,Can this go in? |
Cannot hurt. The only difficulty I see is that folks using it may need to make their code NumPy version dependent. Needs a release note. |
@ahaldane Needs a release note. |
will do |
a868e92
to
09188b4
Compare
09188b4
to
c892733
Compare
1564ee8
to
61371de
Compare
All, right, I updated the release note, and also added the new functions to the If we merge this, I can submit another PR which implements the multifield copy->view change, since now that #8100 is fixed and with these methods implemented, there should be fewer obstacles to making the change. We can discuss it there. |
@shoyer Could you check the dispatchers? |
Dispatchers look good here |
OK, let's get this in and see how things go, |
Apply function 'func' as a reduction across fields of a structured array. | ||
|
||
This is similar to `apply_along_axis`, but treats the fields of a | ||
structured array as an extra axis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this needs a warning that fields are all cast to the same type
""" | ||
|
||
if dst.dtype.names is None: | ||
dst[:] = src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work on 0d arrays, this needs to be ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
return (array,) | ||
|
||
@array_function_dispatch(_require_fields_dispatcher) | ||
def require_fields(array, required_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name strikes me as a little odd, but I also can't think of a better one.
It might be handy to use the word "require" in the description somewhere, to make the name easier to remember.
@ahaldane Want to take care of @eric-wieser comments? Otherwise I'll do it. |
coming up... |
the field datatypes. | ||
|
||
Nested fields, as well as each element of any subarray fields, all count | ||
as a single field-elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if dtype
is itself a structured array? Eg, consider:
point = np.dtype([('x'. int), ('y', int)])
triangle = np.dtype([('p_a', point), ('p_b', point), ('p_c', point)]
I'd expect to be able to do
arr = np.zeros(10, triangle)
structured_to_unstructured(arr, dtype=point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is accounted for, however your particular example shows there is a bug in this code because it can't account for repeated field names in the nested structures. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second examination, I also missed that your output was structured. structured_to_unstructured
doesn't account for that the way you expected, and I'm not sure there is a good "rule" for how it should work. Your example makes sense because each field can be unambiguously cast to the new structured type, so you expect the output shape to be (10, 3) with point
dtype. My implementation currently casts all nested fields to the new dtype, resulting in a (10, 6) array of points: It casts the x and y of each point to a point individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generating "field_{}".format(i)
as the name for each field is probably the safest bet - if you control all the names, you don't need to worry about escape sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's already fixed/implemented in #12446.
I'd rather not attempt to account for structured dtypes in the output though: That's not a pre-existing use-case we're trying to fix, and the best behavior to implement is unclear to me at the moment. Any users who previously did something like that can still do it using repack_fields
instead of structured_to_unstructured
, though without the added safety the latter has added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a pre-existing use-case we're trying to fix,
As I understand it, the purpose of structured_to_unstructured
is to replace the arr[fields].view(dt)
idiom. But it sounds like it doesn't work as a universal replacement:
arr[['f_a', 'f_b']].view(float)
→structured_to_unstructured(arr['f_a', 'f_c'], float)
arr[['p_a', 'p_b']].view(point)
→ ???
Is there a way to spell the second case with these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though without the added safety the latter has added.
Can you give an example of that safety, maybe even in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my understanding:
All code of the form arr[['field1', 'field2, ...]].view(dt)
in 1.15 can be replaced by repack_fields(arr[['field1', 'field2, ...]]).view(dt)
in 1.16, without exceptions. It should be identical performance, since in both cases a copy is made.
Additionally, we have implemented a new function structured_to_unstructured
. Although this can't be used as a replacement in all cases as you point out, in the many cases where it can be used it is better because it avoids a copy for multifield-indexes, it is safer since it is memory-layout-agnostic, and it better documents the user's intent. It is safer because it saves the user from bugs when doing the view
: If the user tries to do the view themselves it is quite easy to forget to account for padding bytes, endianness, dtype, or misunderstand the memory layout, but structured_to_unstructured
is written in a way to guarantee the view is "safe" for any memory layout and dtype (the fields are always viewed with the right offsets). structured_to_unstructured
also casts the fields if needed, which the view above does not do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a description based on my last comment in #12447
return (arr,) | ||
|
||
@array_function_dispatch(_structured_to_unstructured_dispatcher) | ||
def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are places we learnt that unsafe
was a bad default, but ended up stuck with it, leaving users surprised by the conversion.
Should we apply that learning here, and pick a more conservative default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed that. Maybe in #8733? But that was about assignment using unsafe
casting, with no option to specify otherwise, unlike here where there is a keyword the user can specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that issue was one of the ones I was thinking of - thanks for linking it, I was looking for that for unrelated reasons too!
Unsafe just feels like an... unsafe default to me. In my opinion, unsafe behavior should be something you ask for, not something you get by default. You're picking between:
- as is:
f(...)
vsf(..., casting='safe')
- proposed:
f(..., casting='unsafe')
vsf(...)
I'd much rather see the word 'unsafe' to tell me I need to think more carefully about that line of code, rather than having to look for the absence of it.
I don't have a good memory of how the other casting modes behave. I'd be inclined to pick same_kind
to match the default value of the casting argument for for ufunc
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an argument for unsafe
:
First, it matches the default for the same keyword in astype
, and so its easier for the user to remember if they are used to using astype
.
Second, it seems like most of the time the user wants unsafe
, because there are many common casts that are ruled out otherwise. For instance casts from f8
to i8
are disallowed with same_kind
, but I expect this is a very common cast.
Actually, for reasons I don't understand, ufuncs seem to allow casts from f8
to i8
even though they supposedly use same_kind
:
>>> np.arange(3, dtype='f8').astype('i8', casting='same_kind')
TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'same_kind'
>>> np.add(np.arange(3, dtype='f8'), np.arange(3, dtype='i8'))
array([0., 2., 4.])
>>> np.can_cast('f8', 'i8', casting='same_kind')
False
So ufuncs appear to use unsafe
casting despite the keyword default??
Adds helper functions for the copy->view transition for multi-field indexes. Adds
structured_to_unstructured
,apply_along_fields
,assign_fields_by_name
,require_fields
.This is based on the feedback from discussions on the mailing list and on github about cases where users wanted the "old" behavior. See #6053 (comment), and this mailing list thread: http://numpy-discussion.10968.n7.nabble.com/Setting-custom-dtypes-and-1-14-tt45156.html#a45207. Note the reply there arguing we should skip
apply_along_fields
, maybe we should remove that one.Don't merge yet. I'd like to let this sit a while in case there are more ideas for useful functions. Putting it up now so everyone can get a sense of what is being proposed.