ENH: add multi-field assignment helpers in np.lib.recfunctions #11526

ahaldane · 2018-07-07T20:03:28Z

Adds helper functions for the copy->view transition for multi-field indexes. Adds structured_to_unstructured, apply_along_fields, assign_fields_by_name, require_fields.

This is based on the feedback from discussions on the mailing list and on github about cases where users wanted the "old" behavior. See #6053 (comment), and this mailing list thread: http://numpy-discussion.10968.n7.nabble.com/Setting-custom-dtypes-and-1-14-tt45156.html#a45207. Note the reply there arguing we should skip apply_along_fields, maybe we should remove that one.

Don't merge yet. I'd like to let this sit a while in case there are more ideas for useful functions. Putting it up now so everyone can get a sense of what is being proposed.

charris · 2018-07-07T20:48:38Z

If we are going to support both, would it not be more compatible to add new functions for views and keep the old behavior by default?

numpy/lib/recfunctions.py

eric-wieser · 2018-07-08T14:02:26Z

numpy/lib/recfunctions.py

+    n_elem = sum(f[2] for f in fields)
+
+    if dtype is None:
+        out_dtype = np.result_type(*[f[1].base for f in fields])


How did you pick between 'result_type' and not 'common_type'? It would be nice to justify the choice between them, even if the rationale is arbitrary

common_type will convert a dtype with all-integer fields to float, which doesn't seem right. result_type preserves the int type in that case.

eric-wieser · 2018-07-08T14:20:24Z

numpy/lib/recfunctions.py

+            fields.extend(_get_fields_and_offsets(field[0], field[1] + offset))
+    return fields
+
+def structured_to_unstructured(arr, dtype=None):


I'd prefer an implementation that just does arr.astype(replace_all_fields(arr.dtype, dtype)).view(dtype), since then it could take copy and casting parameters to forward to astype

Good idea, done.

As a bonus, the function now returns a view instead of a copy if all the fields have the same dtype, which is the use-case people said they used on the mailing list.

numpy/lib/recfunctions.py

ahaldane · 2018-07-08T16:53:51Z

@charris, that's still a possibility. I definitely don't mean to push these structured array changes without some support, or at least passive agreement, from the numpy developer community, especially if there is a backcompat break.

#6053 actually makes two, somewhat independent, changes: 1. It makes multifield indexing return a view, and 2. it makes multifield assignment work "by position". In this PR the methods repack_fields and (un)structured_to(un)structured are to help transition for 1, and assign_fields_by_name and require_fields are for 2. I think these changes help fix a lot of bugs, inconsistencies, and confusing behavior.

We can definitely go back one either or both of those. Let me try exploring that in another PR before finishing this one.

charris · 2018-07-08T20:03:34Z

Just to be clear, numpy 1.15 will still be back compatible? Or, put differently, if we broke backward compatibility, when did we do it?

ahaldane · 2018-07-08T23:00:11Z

Remember there are two behavior changes:

For the "multifield indexing returns a view" behavior, we put it off until 1.16, so nothing is broken yet.

For the "copy-by-position" behavior, we already broke back-compatibility in 1.14.

charris · 2018-07-16T20:38:17Z

@ahaldane Does this still need consideration for 1.15.x?

charris · 2018-08-17T03:39:49Z

@ahaldane What is the status of all this stuff?

ahaldane · 2018-08-17T21:28:04Z

Sorry, I've been out/busy! I'm hoping to be more active soon...

I think the status is that there is still some hesitation about the behavior changes I wanted for structured arrays. There are two alternative proposals:

Keep all the new numpy 1.15 behavior (already released), which makes all struct assignment "assign by position", and add these helper functions to help users like in MAINT: struct assignment "by field position", multi-field indices return views #6053 (comment), whose code I broke.
Choose the alternate option, of using "assign by name" for all struct assignment. (Implemented in MAINT: cancel multifield copy->view and copy-by-position plans (alternate proposal) #11530) This is in response to MAINT: struct assignment "by field position", multi-field indices return views #6053 (comment) and BUG: Array copy does not zero out gaps in structured dtypes #10789, where a user has code which I broke.

Here is the central reason for hesitation: Both option 1 and 2 break some compatibility with 1.14, because 1.14 was inconsistent with itself so any fix would have to break something. However, it may be the case that we break fewer people's code with option 2. On the other hand, I find option 1 much more elegant and gives more reasonable future behavior.

mattip · 2018-09-12T11:54:37Z

Is this ready to hit the mailing list for discussion?

Adds helper functions for the copy->view transition for multi-field indexes. Adds `structured_to_unstructured`, `apply_along_fields`, `assign_fields_by_name`, `require_fields`.

ahaldane · 2018-10-31T21:37:51Z

Fixed up based on latest comments.

I think some discussion of the situation is still necessary, and it is on the agenda for the next status meeting. In preparation I've written up a draft of the structured array change proposal at https://gist.github.com/ahaldane/6cd44886efb449f9c8d5ea012747323b

Looking over it again now, my feeling is we should go ahead with this PR an all of the proposed changes, despite some back-compat breaks.

mattip · 2018-11-14T22:33:43Z

The mailing list / design document met with little dissent. ,Can this go in?

charris · 2018-11-14T23:16:42Z

Cannot hurt. The only difficulty I see is that folks using it may need to make their code NumPy version dependent. Needs a release note.

charris · 2018-11-20T17:13:56Z

@ahaldane Needs a release note.

ahaldane · 2018-11-20T17:59:05Z

will do

ahaldane · 2018-11-23T22:36:24Z

All, right, I updated the release note, and also added the new functions to the __array_function__ api like the other functions in the file.

If we merge this, I can submit another PR which implements the multifield copy->view change, since now that #8100 is fixed and with these methods implemented, there should be fewer obstacles to making the change. We can discuss it there.

charris · 2018-11-23T23:54:58Z

@shoyer Could you check the dispatchers?

shoyer · 2018-11-24T01:04:41Z

Dispatchers look good here

charris · 2018-11-24T01:09:26Z

OK, let's get this in and see how things go,

eric-wieser · 2018-11-24T01:41:52Z

numpy/lib/recfunctions.py

+    Apply function 'func' as a reduction across fields of a structured array.
+
+    This is similar to `apply_along_axis`, but treats the fields of a
+    structured array as an extra axis.


I think that this needs a warning that fields are all cast to the same type

eric-wieser · 2018-11-24T01:43:49Z

numpy/lib/recfunctions.py

+    """
+
+    if dst.dtype.names is None:
+        dst[:] = src


To work on 0d arrays, this needs to be ...

Good point.

eric-wieser · 2018-11-24T01:46:16Z

numpy/lib/recfunctions.py

+    return (array,)
+
+@array_function_dispatch(_require_fields_dispatcher)
+def require_fields(array, required_dtype):


This name strikes me as a little odd, but I also can't think of a better one.

It might be handy to use the word "require" in the description somewhere, to make the name easier to remember.

charris · 2018-11-24T19:15:51Z

@ahaldane Want to take care of @eric-wieser comments? Otherwise I'll do it.

ahaldane · 2018-11-24T22:12:25Z

coming up...

eric-wieser · 2018-11-25T00:31:35Z

numpy/lib/recfunctions.py

+    the field datatypes.
+
+    Nested fields, as well as each element of any subarray fields, all count
+    as a single field-elements.


What happens if dtype is itself a structured array? Eg, consider:

point = np.dtype([('x'. int), ('y', int)]) triangle = np.dtype([('p_a', point), ('p_b', point), ('p_c', point)]

I'd expect to be able to do

arr = np.zeros(10, triangle) structured_to_unstructured(arr, dtype=point)

This is accounted for, however your particular example shows there is a bug in this code because it can't account for repeated field names in the nested structures. Will fix.

On second examination, I also missed that your output was structured. structured_to_unstructured doesn't account for that the way you expected, and I'm not sure there is a good "rule" for how it should work. Your example makes sense because each field can be unambiguously cast to the new structured type, so you expect the output shape to be (10, 3) with point dtype. My implementation currently casts all nested fields to the new dtype, resulting in a (10, 6) array of points: It casts the x and y of each point to a point individually.

Generating "field_{}".format(i) as the name for each field is probably the safest bet - if you control all the names, you don't need to worry about escape sequences.

Yup, that's already fixed/implemented in #12446.

I'd rather not attempt to account for structured dtypes in the output though: That's not a pre-existing use-case we're trying to fix, and the best behavior to implement is unclear to me at the moment. Any users who previously did something like that can still do it using repack_fields instead of structured_to_unstructured, though without the added safety the latter has added.

That's not a pre-existing use-case we're trying to fix,

As I understand it, the purpose of structured_to_unstructured is to replace the arr[fields].view(dt) idiom. But it sounds like it doesn't work as a universal replacement:

arr[['f_a', 'f_b']].view(float) → structured_to_unstructured(arr['f_a', 'f_c'], float)

arr[['p_a', 'p_b']].view(point) → ???

Is there a way to spell the second case with these functions?

though without the added safety the latter has added.

Can you give an example of that safety, maybe even in the docs?

Here's my understanding:

All code of the form arr[['field1', 'field2, ...]].view(dt) in 1.15 can be replaced by repack_fields(arr[['field1', 'field2, ...]]).view(dt) in 1.16, without exceptions. It should be identical performance, since in both cases a copy is made.

Additionally, we have implemented a new function structured_to_unstructured. Although this can't be used as a replacement in all cases as you point out, in the many cases where it can be used it is better because it avoids a copy for multifield-indexes, it is safer since it is memory-layout-agnostic, and it better documents the user's intent. It is safer because it saves the user from bugs when doing the view: If the user tries to do the view themselves it is quite easy to forget to account for padding bytes, endianness, dtype, or misunderstand the memory layout, but structured_to_unstructured is written in a way to guarantee the view is "safe" for any memory layout and dtype (the fields are always viewed with the right offsets). structured_to_unstructured also casts the fields if needed, which the view above does not do.

I added a description based on my last comment in #12447

eric-wieser · 2018-11-25T00:32:56Z

numpy/lib/recfunctions.py

+    return (arr,)
+
+@array_function_dispatch(_structured_to_unstructured_dispatcher)
+def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'):


I think there are places we learnt that unsafe was a bad default, but ended up stuck with it, leaving users surprised by the conversion.

Should we apply that learning here, and pick a more conservative default?

I may have missed that. Maybe in #8733? But that was about assignment using unsafe casting, with no option to specify otherwise, unlike here where there is a keyword the user can specify.

Yeah, that issue was one of the ones I was thinking of - thanks for linking it, I was looking for that for unrelated reasons too!

Unsafe just feels like an... unsafe default to me. In my opinion, unsafe behavior should be something you ask for, not something you get by default. You're picking between:

as is: f(...) vs f(..., casting='safe')

proposed: f(..., casting='unsafe') vs f(...)

I'd much rather see the word 'unsafe' to tell me I need to think more carefully about that line of code, rather than having to look for the absence of it.

I don't have a good memory of how the other casting modes behave. I'd be inclined to pick same_kind to match the default value of the casting argument for for ufuncs

Here's an argument for unsafe:

First, it matches the default for the same keyword in astype, and so its easier for the user to remember if they are used to using astype.

Second, it seems like most of the time the user wants unsafe, because there are many common casts that are ruled out otherwise. For instance casts from f8 to i8 are disallowed with same_kind, but I expect this is a very common cast.

Actually, for reasons I don't understand, ufuncs seem to allow casts from f8 to i8 even though they supposedly use same_kind:

>>> np.arange(3, dtype='f8').astype('i8', casting='same_kind') TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'same_kind' >>> np.add(np.arange(3, dtype='f8'), np.arange(3, dtype='i8')) array([0., 2., 4.]) >>> np.can_cast('f8', 'i8', casting='same_kind') False

So ufuncs appear to use unsafe casting despite the keyword default??

ahaldane added 01 - Enhancement component: numpy.lib labels Jul 7, 2018

ahaldane added this to the 1.16.0 release milestone Jul 7, 2018

ahaldane mentioned this pull request Jul 7, 2018

MAINT: struct assignment "by field position", multi-field indices return views #6053

Merged

charris added the 09 - Backport-Candidate PRs tagged should be backported label Jul 7, 2018

eric-wieser reviewed Jul 8, 2018

View reviewed changes

numpy/lib/recfunctions.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jul 8, 2018

View reviewed changes

numpy/lib/recfunctions.py Show resolved Hide resolved

ahaldane mentioned this pull request Jul 8, 2018

MAINT: cancel multifield copy->view and copy-by-position plans (alternate proposal) #11530

Closed

mattip added the component: numpy.dtype label Jul 8, 2018

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 17, 2018

ENH: add multi-field assignment helpers in np.lib.recfunctions

f1fba70

Adds helper functions for the copy->view transition for multi-field indexes. Adds `structured_to_unstructured`, `apply_along_fields`, `assign_fields_by_name`, `require_fields`.

ahaldane force-pushed the add_struct_helper_funcs_redo branch from 4297d69 to a868e92 Compare October 31, 2018 21:32

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Nov 14, 2018

ahaldane force-pushed the add_struct_helper_funcs_redo branch from a868e92 to 09188b4 Compare November 22, 2018 23:06

ENH: Fixups to multi-field assignment helpers

c892733

ahaldane force-pushed the add_struct_helper_funcs_redo branch from 09188b4 to c892733 Compare November 22, 2018 23:16

MAINT: Add new recfunctions to numpy function API

61371de

ahaldane force-pushed the add_struct_helper_funcs_redo branch from 1564ee8 to 61371de Compare November 23, 2018 21:36

charris merged commit 983bbb5 into numpy:master Nov 24, 2018

eric-wieser reviewed Nov 24, 2018

View reviewed changes

This was referenced Nov 24, 2018

MAINT: Fixups to new functions in np.lib.recfunctions #12446

Merged

ENH: add back the multifield copy->view change #12447

Merged

eric-wieser reviewed Nov 25, 2018

View reviewed changes

eric-wieser mentioned this pull request Apr 15, 2019

Weird behavior of structured_to_unstructured on non-trivial dtypes #13333

Closed

mattip mentioned this pull request Jul 2, 2019

DOC, ENH: add new recfunctions to __all__ #13889

Closed

ENH: add multi-field assignment helpers in np.lib.recfunctions #11526

ENH: add multi-field assignment helpers in np.lib.recfunctions #11526

Conversation

ahaldane commented Jul 7, 2018

charris commented Jul 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Jul 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaldane commented Jul 8, 2018 • edited Loading

charris commented Jul 8, 2018

ahaldane commented Jul 8, 2018

charris commented Jul 16, 2018

charris commented Aug 17, 2018

ahaldane commented Aug 17, 2018 • edited Loading

mattip commented Sep 12, 2018

ahaldane commented Oct 31, 2018

mattip commented Nov 14, 2018

charris commented Nov 14, 2018

charris commented Nov 20, 2018

ahaldane commented Nov 20, 2018

ahaldane commented Nov 23, 2018

charris commented Nov 23, 2018

shoyer commented Nov 24, 2018

charris commented Nov 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented Nov 24, 2018

ahaldane commented Nov 24, 2018

eric-wieser Nov 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaldane Nov 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaldane Nov 26, 2018 • edited Loading

Choose a reason for hiding this comment

eric-wieser Jul 8, 2018 •

edited

Loading

ahaldane commented Jul 8, 2018 •

edited

Loading

ahaldane commented Aug 17, 2018 •

edited

Loading

eric-wieser Nov 25, 2018 •

edited

Loading

ahaldane Nov 25, 2018 •

edited

Loading

ahaldane Nov 26, 2018 •

edited

Loading