Skip to content

ENH: add multi-field assignment helpers in np.lib.recfunctions #11526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 24, 2018

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Jul 7, 2018

Adds helper functions for the copy->view transition for multi-field indexes. Adds structured_to_unstructured, apply_along_fields, assign_fields_by_name, require_fields.

This is based on the feedback from discussions on the mailing list and on github about cases where users wanted the "old" behavior. See #6053 (comment), and this mailing list thread: http://numpy-discussion.10968.n7.nabble.com/Setting-custom-dtypes-and-1-14-tt45156.html#a45207. Note the reply there arguing we should skip apply_along_fields, maybe we should remove that one.

Don't merge yet. I'd like to let this sit a while in case there are more ideas for useful functions. Putting it up now so everyone can get a sense of what is being proposed.

@charris
Copy link
Member

charris commented Jul 7, 2018

If we are going to support both, would it not be more compatible to add new functions for views and keep the old behavior by default?

n_elem = sum(f[2] for f in fields)

if dtype is None:
out_dtype = np.result_type(*[f[1].base for f in fields])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you pick between 'result_type' and not 'common_type'? It would be nice to justify the choice between them, even if the rationale is arbitrary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common_type will convert a dtype with all-integer fields to float, which doesn't seem right. result_type preserves the int type in that case.

fields.extend(_get_fields_and_offsets(field[0], field[1] + offset))
return fields

def structured_to_unstructured(arr, dtype=None):
Copy link
Member

@eric-wieser eric-wieser Jul 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer an implementation that just does arr.astype(replace_all_fields(arr.dtype, dtype)).view(dtype), since then it could take copy and casting parameters to forward to astype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

As a bonus, the function now returns a view instead of a copy if all the fields have the same dtype, which is the use-case people said they used on the mailing list.

@ahaldane
Copy link
Member Author

ahaldane commented Jul 8, 2018

@charris, that's still a possibility. I definitely don't mean to push these structured array changes without some support, or at least passive agreement, from the numpy developer community, especially if there is a backcompat break.

#6053 actually makes two, somewhat independent, changes: 1. It makes multifield indexing return a view, and 2. it makes multifield assignment work "by position". In this PR the methods repack_fields and (un)structured_to(un)structured are to help transition for 1, and assign_fields_by_name and require_fields are for 2. I think these changes help fix a lot of bugs, inconsistencies, and confusing behavior.

We can definitely go back one either or both of those. Let me try exploring that in another PR before finishing this one.

@charris
Copy link
Member

charris commented Jul 8, 2018

Just to be clear, numpy 1.15 will still be back compatible? Or, put differently, if we broke backward compatibility, when did we do it?

@ahaldane
Copy link
Member Author

ahaldane commented Jul 8, 2018

Remember there are two behavior changes:

For the "multifield indexing returns a view" behavior, we put it off until 1.16, so nothing is broken yet.

For the "copy-by-position" behavior, we already broke back-compatibility in 1.14.

@charris
Copy link
Member

charris commented Jul 16, 2018

@ahaldane Does this still need consideration for 1.15.x?

@charris
Copy link
Member

charris commented Aug 17, 2018

@ahaldane What is the status of all this stuff?

@ahaldane
Copy link
Member Author

ahaldane commented Aug 17, 2018

Sorry, I've been out/busy! I'm hoping to be more active soon...

I think the status is that there is still some hesitation about the behavior changes I wanted for structured arrays. There are two alternative proposals:

  1. Keep all the new numpy 1.15 behavior (already released), which makes all struct assignment "assign by position", and add these helper functions to help users like in MAINT: struct assignment "by field position", multi-field indices return views #6053 (comment), whose code I broke.
  2. Choose the alternate option, of using "assign by name" for all struct assignment. (Implemented in MAINT: cancel multifield copy->view and copy-by-position plans (alternate proposal) #11530) This is in response to MAINT: struct assignment "by field position", multi-field indices return views #6053 (comment) and BUG: Array copy does not zero out gaps in structured dtypes #10789, where a user has code which I broke.

Here is the central reason for hesitation: Both option 1 and 2 break some compatibility with 1.14, because 1.14 was inconsistent with itself so any fix would have to break something. However, it may be the case that we break fewer people's code with option 2. On the other hand, I find option 1 much more elegant and gives more reasonable future behavior.

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 17, 2018
@mattip
Copy link
Member

mattip commented Sep 12, 2018

Is this ready to hit the mailing list for discussion?

Adds helper functions for the copy->view transition for multi-field
indexes. Adds `structured_to_unstructured`, `apply_along_fields`,
`assign_fields_by_name`, `require_fields`.
@ahaldane ahaldane force-pushed the add_struct_helper_funcs_redo branch from 4297d69 to a868e92 Compare October 31, 2018 21:32
@ahaldane
Copy link
Member Author

Fixed up based on latest comments.

I think some discussion of the situation is still necessary, and it is on the agenda for the next status meeting. In preparation I've written up a draft of the structured array change proposal at https://gist.github.com/ahaldane/6cd44886efb449f9c8d5ea012747323b

Looking over it again now, my feeling is we should go ahead with this PR an all of the proposed changes, despite some back-compat breaks.

@mattip
Copy link
Member

mattip commented Nov 14, 2018

The mailing list / design document met with little dissent. ,Can this go in?

@charris
Copy link
Member

charris commented Nov 14, 2018

Cannot hurt. The only difficulty I see is that folks using it may need to make their code NumPy version dependent. Needs a release note.

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Nov 14, 2018
@charris
Copy link
Member

charris commented Nov 20, 2018

@ahaldane Needs a release note.

@ahaldane
Copy link
Member Author

will do

@ahaldane ahaldane force-pushed the add_struct_helper_funcs_redo branch from a868e92 to 09188b4 Compare November 22, 2018 23:06
@ahaldane ahaldane force-pushed the add_struct_helper_funcs_redo branch from 09188b4 to c892733 Compare November 22, 2018 23:16
@ahaldane ahaldane force-pushed the add_struct_helper_funcs_redo branch from 1564ee8 to 61371de Compare November 23, 2018 21:36
@ahaldane
Copy link
Member Author

All, right, I updated the release note, and also added the new functions to the __array_function__ api like the other functions in the file.

If we merge this, I can submit another PR which implements the multifield copy->view change, since now that #8100 is fixed and with these methods implemented, there should be fewer obstacles to making the change. We can discuss it there.

@charris
Copy link
Member

charris commented Nov 23, 2018

@shoyer Could you check the dispatchers?

@shoyer
Copy link
Member

shoyer commented Nov 24, 2018

Dispatchers look good here

@charris charris merged commit 983bbb5 into numpy:master Nov 24, 2018
@charris
Copy link
Member

charris commented Nov 24, 2018

OK, let's get this in and see how things go,

Apply function 'func' as a reduction across fields of a structured array.

This is similar to `apply_along_axis`, but treats the fields of a
structured array as an extra axis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this needs a warning that fields are all cast to the same type

"""

if dst.dtype.names is None:
dst[:] = src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work on 0d arrays, this needs to be ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

return (array,)

@array_function_dispatch(_require_fields_dispatcher)
def require_fields(array, required_dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name strikes me as a little odd, but I also can't think of a better one.

It might be handy to use the word "require" in the description somewhere, to make the name easier to remember.

@charris
Copy link
Member

charris commented Nov 24, 2018

@ahaldane Want to take care of @eric-wieser comments? Otherwise I'll do it.

@ahaldane
Copy link
Member Author

coming up...

the field datatypes.

Nested fields, as well as each element of any subarray fields, all count
as a single field-elements.
Copy link
Member

@eric-wieser eric-wieser Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dtype is itself a structured array? Eg, consider:

point = np.dtype([('x'. int), ('y', int)])
triangle = np.dtype([('p_a', point), ('p_b', point), ('p_c', point)]

I'd expect to be able to do

arr = np.zeros(10, triangle)

structured_to_unstructured(arr, dtype=point)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accounted for, however your particular example shows there is a bug in this code because it can't account for repeated field names in the nested structures. Will fix.

Copy link
Member Author

@ahaldane ahaldane Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second examination, I also missed that your output was structured. structured_to_unstructured doesn't account for that the way you expected, and I'm not sure there is a good "rule" for how it should work. Your example makes sense because each field can be unambiguously cast to the new structured type, so you expect the output shape to be (10, 3) with point dtype. My implementation currently casts all nested fields to the new dtype, resulting in a (10, 6) array of points: It casts the x and y of each point to a point individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating "field_{}".format(i) as the name for each field is probably the safest bet - if you control all the names, you don't need to worry about escape sequences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's already fixed/implemented in #12446.

I'd rather not attempt to account for structured dtypes in the output though: That's not a pre-existing use-case we're trying to fix, and the best behavior to implement is unclear to me at the moment. Any users who previously did something like that can still do it using repack_fields instead of structured_to_unstructured, though without the added safety the latter has added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a pre-existing use-case we're trying to fix,

As I understand it, the purpose of structured_to_unstructured is to replace the arr[fields].view(dt) idiom. But it sounds like it doesn't work as a universal replacement:

  • arr[['f_a', 'f_b']].view(float)structured_to_unstructured(arr['f_a', 'f_c'], float)
  • arr[['p_a', 'p_b']].view(point) → ???

Is there a way to spell the second case with these functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though without the added safety the latter has added.

Can you give an example of that safety, maybe even in the docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding:

All code of the form arr[['field1', 'field2, ...]].view(dt) in 1.15 can be replaced by repack_fields(arr[['field1', 'field2, ...]]).view(dt) in 1.16, without exceptions. It should be identical performance, since in both cases a copy is made.

Additionally, we have implemented a new function structured_to_unstructured. Although this can't be used as a replacement in all cases as you point out, in the many cases where it can be used it is better because it avoids a copy for multifield-indexes, it is safer since it is memory-layout-agnostic, and it better documents the user's intent. It is safer because it saves the user from bugs when doing the view: If the user tries to do the view themselves it is quite easy to forget to account for padding bytes, endianness, dtype, or misunderstand the memory layout, but structured_to_unstructured is written in a way to guarantee the view is "safe" for any memory layout and dtype (the fields are always viewed with the right offsets). structured_to_unstructured also casts the fields if needed, which the view above does not do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a description based on my last comment in #12447

return (arr,)

@array_function_dispatch(_structured_to_unstructured_dispatcher)
def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are places we learnt that unsafe was a bad default, but ended up stuck with it, leaving users surprised by the conversion.

Should we apply that learning here, and pick a more conservative default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed that. Maybe in #8733? But that was about assignment using unsafe casting, with no option to specify otherwise, unlike here where there is a keyword the user can specify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that issue was one of the ones I was thinking of - thanks for linking it, I was looking for that for unrelated reasons too!

Unsafe just feels like an... unsafe default to me. In my opinion, unsafe behavior should be something you ask for, not something you get by default. You're picking between:

  • as is: f(...) vs f(..., casting='safe')
  • proposed: f(..., casting='unsafe') vs f(...)

I'd much rather see the word 'unsafe' to tell me I need to think more carefully about that line of code, rather than having to look for the absence of it.

I don't have a good memory of how the other casting modes behave. I'd be inclined to pick same_kind to match the default value of the casting argument for for ufuncs

Copy link
Member Author

@ahaldane ahaldane Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an argument for unsafe:

First, it matches the default for the same keyword in astype, and so its easier for the user to remember if they are used to using astype.

Second, it seems like most of the time the user wants unsafe, because there are many common casts that are ruled out otherwise. For instance casts from f8 to i8 are disallowed with same_kind, but I expect this is a very common cast.

Actually, for reasons I don't understand, ufuncs seem to allow casts from f8 to i8 even though they supposedly use same_kind:

>>> np.arange(3, dtype='f8').astype('i8', casting='same_kind')                 
TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'same_kind'
>>> np.add(np.arange(3, dtype='f8'), np.arange(3, dtype='i8'))                 
array([0., 2., 4.])
>>> np.can_cast('f8', 'i8', casting='same_kind')
False

So ufuncs appear to use unsafe casting despite the keyword default??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants