ENH: Allow dtype objects to be indexed with multiple fields at once #10417

eric-wieser · 2018-01-17T08:34:13Z

This returns a dtype of the same size with other fields removed, just like that used in gh-6053

It should be possible to fall back on this implementation in mapping.c:_get_field_view - would appreciate comments on that.

Some ways in which this differs from array.__getitem__:

Empty lists are allowed, since they're not ambiguous vs index arrays
Only lists are allowed, which avoids having to deal with malformed sequence objects
KeyError is raised for invalid field names instead of ValueError (relates to Incorrect Exception when indexing array with field. #8519)

mhvk · 2018-01-17T16:06:11Z

This looks good in itself, but I wonder about indexing based on a list -- it seems to be much clearer to allow dtype['col1', 'col2'] (and tuples are more like keys to a dict, as they are hashable). Is this just by analogy with array indexing? (I don't understand why that isn't allowed in structured arrays either; it seems unambiguous.)

ahaldane · 2018-01-17T16:55:43Z

numpy/core/src/multiarray/descriptor.c

+{
+    int seqlen, i;
+    PyObject *fields;
+    PyObject *names;


Initialize these to NULL so XDECREF doesn't fail in the first few goto fails.

eric-wieser · 2018-01-17T16:57:11Z

it seems to be much clearer to allow dtype['col1', 'col2']

Note that then dtype['col'] and dtype['col',] are a lot less distinguishable than dtype['col'] and dtype[['col']].

ahaldane · 2018-01-17T17:06:54Z

Otherwise code looks good. Now for the hard questions:

Will it be confusing if ndarray multifield-indexing removes padding, but dtype-multifield indexing keeps it? I kind of think so, in which case I think both should keep padding.

Is it a good idea to add more common complex behavior to dtypes, given that we eventually want to rewrite dtypes so that each dtype has its own separate code? Eg, only structured dtypes need to know about multifield indexing, it is not needed for dtype(int) for example.
I think my answer is it is OK, since this is pretty clearly for structured dtypes only, and we can just lift it out of the common dtype code the day we split up the types.

I agree that if we add this, _get_field_view can be simplified.

eric-wieser · 2018-01-17T17:12:29Z

in which case I think both should keep padding.

I retract my earlier comment about that elsewhere, and now agree with you. Being consistent is most useful here.

Is it a good idea to add more common complex behavior to dtypes, given that we eventually want to rewrite dtypes so that each dtype has its own separate code?

Yes, because this is already guarded away inside the HAS_FIELDS check - it would just move under the np.void or possibly a new np.structured class, and remain largely unchanged.

I agree that if we add this, _get_field_view can be simplified.

Are you ok with the exception type changes? Perhaps I should submit a PR to adjust those in _get_field_view first

ahaldane · 2018-01-17T17:16:06Z

On the change to KeyError, see #5636 (comment) and discussion there.

I'm for it, but it is a real backcompat issue.

ahaldane · 2018-01-17T17:26:50Z

Perhaps we can bundle the KeyError change with the copy->view change if we delay both until 1.15. That way code-breakage is only for one release.

mhvk · 2018-01-17T18:49:26Z

I'd hope that anybody who needs a single-item view would write dtype[('col1',)]... It is not something that should determine the logic, I think. Instead, it would seem to be mostly an argument of compatibility with ndarray or with dict keys. And here, compatibility with ndarray probably is more important; I brought it up mostly because it feels unnatural (and our astropy's Table class quite happily indexes with tuples of column names -- but also with lists.

eric-wieser · 2018-01-18T09:13:18Z

Updated with _get_field_view implemented in terms of this

eric-wieser · 2018-01-18T09:18:57Z

numpy/core/tests/test_multiarray.py

@@ -1157,9 +1157,11 @@ def test_structuredscalar_indexing(self):
    def test_multiindex_titles(self):
        a = np.zeros(4, dtype=[(('a', 'b'), 'i'), ('c', 'i'), ('d', 'i')])
        assert_raises(KeyError, lambda : a[['a','c']])
-        assert_raises(KeyError, lambda : a[['b','b']])
+        assert_raises(KeyError, lambda : a[['a','a']])
+        assert_raises(ValueError, lambda : a[['b','b']])  # field exists, but repeated


I think this is a new error as of 1.14, so we can change it in 1.14.1 without compatibility concerns.

Edit: see #10430 - will rebase this change on top of that

mhvk

With the second commit, it becomes clearer why this is a nice thing to do! I'd still prefer a tuple, but maybe that is best treated as a separate issue -- for compatibility, we'd need to accept a list anyway.

numpy/core/src/multiarray/descriptor.c

mhvk · 2018-01-18T13:33:28Z

numpy/core/src/multiarray/mapping.c

-        int seqlen, i;
-        PyObject *name = NULL, *tup;
-        PyObject *fields, *names;
+        PyObject *list_ind;


The errors are because you end up not using list_ind

eric-wieser · 2018-01-18T17:09:47Z

numpy/core/src/multiarray/descriptor.c

+        }
+    }
+
+    view_dtype = PyArray_DescrNewFromType(NPY_VOID);


For a later commit, but: This should probably just copy the original dtype.

eric-wieser · 2018-01-18T17:12:21Z

numpy/core/src/multiarray/mapping.c

-                Py_DECREF(names);
-                return 0;
+        /* Convert sequences into a list, which is what dtype.__getitem__
+           expects. */


Here I'm arbitrarily deciding that descr.__getitem__ is going to be strict about the sequence type it gets passed, since it has no compatibility concerns. However, we still need to be liberal in what we accept here in array.__getitem__.

numpy/core/src/multiarray/descriptor.c

eric-wieser · 2018-01-19T05:51:18Z

Oh heck:

Fatal Python error: ../Objects/tupleobject.c:236 object at 0x7f9744242450 has negative ref count -2604246222170760230

That refcount is (int64_t) (0xdbdbdbdbdbdbdbdb - 1), which is the result of decrefing freed memory...

eric-wieser · 2018-01-19T07:18:27Z

Gave up on the forcing exact lists - it didn't actually help all that much with the refcounting

mhvk · 2018-01-19T22:08:15Z

@eric-wieser - I'm confused by your above comment. Is this PR OK, or are there still problems?

eric-wieser · 2018-01-19T22:18:34Z

There were problems, so I adjusted the interface of the internal function to accept any sequence.

The remark about handing KeyError still needs resolving

mattip · 2019-01-13T08:03:55Z

@eric-wieser would you like some help finishing this?

eric-wieser · 2019-01-13T08:08:57Z

Help reminding me that it exists is enough for now - I'll assign it to myself to make it easier to rediscover.

mattip · 2019-03-10T10:25:05Z

ping. this still has an approved status but seems to need work

mhvk · 2019-03-10T15:48:22Z

I tried the "re-request" review button, but that did not remove my approval. In any case, I think there was little left...

eric-wieser · 2019-05-11T19:05:49Z

Release note and test of the alignedstruct flag added

seberg

Could add a test for the single "title" case, but the PR does not touch that code, so that does not matter.

numpy/core/src/multiarray/descriptor.c

seberg · 2019-05-11T21:16:14Z

numpy/core/src/multiarray/descriptor.c

+           decref name if an error occurs further on. */
+        if (PyTuple_SetItem(names, i, name) < 0) {
+            goto fail;
+        }


PyTuple_SET_ITEM is fine (and you can remove the error check).

numpy/core/src/multiarray/mapping.c

This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053 It should be possible to fall back on this implementation in mapping.c:_get_field_view

charris · 2019-05-12T03:17:53Z

close/reopen

seberg · 2019-05-12T07:55:25Z

Used that "resolve" button, seems somewhat annoying with that marge commit though, ah well, may clean it up tomorrow (or so, or just merge it anyway...).

eric-wieser · 2019-06-05T06:30:46Z

Bitten by the problems described in #13707 again

eric-wieser · 2019-06-05T06:31:38Z

@seberg, happy to merge this?

seberg

LGTM, I am about to merge. But before I do: We do not think it would be plausible for this indexing to "squeeze" the fields (remove padding), right? (I mean the dtype one, not the way it is used in indexing of course).

Just want to be sure that there is either little chance of that, or we do not care about changing it later on.

If you feel there is not. Feel free to merge yourself Eric.

seberg · 2019-06-07T21:54:14Z

numpy/core/src/multiarray/mapping.c

-                /* only happens for strange sequence objects */
+            npy_bool is_string;
+            PyObject *item = PySequence_GetItem(ind, i);
+            if(item == NULL) {


Suggested change

if(item == NULL) {

if (item == NULL) {

Just nitpicking for no reason.

eric-wieser · 2019-06-07T22:01:21Z

We do not think it would be plausible for this indexing to "squeeze" the fields (remove padding), right?

No, I think the value comes from this meaning exactly the same thing as it does in indexing arrays, and the change to make array indexing not squeeze fields was quite deliberate.

seberg · 2019-06-07T22:14:06Z

OK, I tend to agree, but wasn't into that discussion too deeply. I will just squash and merge this then.

eric-wieser added the 01 - Enhancement label Jan 17, 2018

eric-wieser requested a review from ahaldane January 17, 2018 08:34

eric-wieser added the 55 - Needs work label Jan 17, 2018

ahaldane reviewed Jan 17, 2018

View reviewed changes

eric-wieser force-pushed the dtype-lookup-sequence branch from 74532c5 to b596fb0 Compare January 18, 2018 08:21

eric-wieser force-pushed the dtype-lookup-sequence branch 2 times, most recently from 2884320 to 23db49c Compare January 18, 2018 09:18

eric-wieser commented Jan 18, 2018

View reviewed changes

mhvk approved these changes Jan 18, 2018

View reviewed changes

numpy/core/src/multiarray/descriptor.c Show resolved Hide resolved

mhvk reviewed Jan 18, 2018

View reviewed changes

eric-wieser force-pushed the dtype-lookup-sequence branch from 23db49c to 040af92 Compare January 18, 2018 17:07

eric-wieser commented Jan 18, 2018

View reviewed changes

numpy/core/src/multiarray/descriptor.c Show resolved Hide resolved

eric-wieser force-pushed the dtype-lookup-sequence branch 2 times, most recently from d2e564b to 1d7e45f Compare January 19, 2018 05:16

eric-wieser force-pushed the dtype-lookup-sequence branch from 1d7e45f to b3c2116 Compare January 19, 2018 06:55

eric-wieser removed the 55 - Needs work label Jan 19, 2018

eric-wieser self-assigned this Jan 13, 2019

mhvk self-requested a review March 10, 2019 15:47

ahaldane mentioned this pull request Apr 11, 2019

Bug in numpy.copy for views of structured arrays in NumPy 1.16.x #13299

Closed

eric-wieser force-pushed the dtype-lookup-sequence branch from b3c2116 to 33849c1 Compare May 11, 2019 19:05

eric-wieser removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 11, 2019

eric-wieser added this to the 1.17.0 release milestone May 11, 2019

eric-wieser requested a review from seberg May 11, 2019 21:00

seberg approved these changes May 11, 2019

View reviewed changes

eric-wieser added 2 commits May 11, 2019 15:18

ENH: Allow dtype objects to be indexing with multiple fields at once

2104353

This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053 It should be possible to fall back on this implementation in mapping.c:_get_field_view

MAINT: Reuse code from dtype.__getitem__ in mapping.c:_get_field_view

90f710b

eric-wieser force-pushed the dtype-lookup-sequence branch from 597cc36 to 90f710b Compare May 11, 2019 22:19

charris closed this May 12, 2019

charris reopened this May 12, 2019

Merge branch 'master' into dtype-lookup-sequence

32a5702

Merge branch 'master' into dtype-lookup-sequence

85537e9

seberg self-requested a review June 5, 2019 20:46

seberg approved these changes Jun 7, 2019

View reviewed changes

fixup, add missing whitespace

0f01fd4

seberg merged commit 51ee454 into numpy:master Jun 7, 2019

Uh oh!

ENH: Allow dtype objects to be indexed with multiple fields at once #10417

ENH: Allow dtype objects to be indexed with multiple fields at once #10417

Uh oh!

Conversation

eric-wieser commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Jan 17, 2018

Uh oh!

ahaldane Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

eric-wieser Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Jan 17, 2018

Uh oh!

ahaldane commented Jan 17, 2018

Uh oh!

eric-wieser commented Jan 17, 2018

Uh oh!

ahaldane commented Jan 17, 2018

Uh oh!

ahaldane commented Jan 17, 2018

Uh oh!

mhvk commented Jan 17, 2018

Uh oh!

eric-wieser commented Jan 18, 2018

Uh oh!

eric-wieser Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhvk Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

eric-wieser Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

eric-wieser Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eric-wieser commented Jan 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 19, 2018

Uh oh!

mhvk commented Jan 19, 2018

Uh oh!

eric-wieser commented Jan 19, 2018

Uh oh!

mattip commented Jan 13, 2019

Uh oh!

eric-wieser commented Jan 13, 2019

Uh oh!

mattip commented Mar 10, 2019

Uh oh!

mhvk commented Mar 10, 2019

Uh oh!

eric-wieser commented May 11, 2019

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seberg May 11, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

charris commented May 12, 2019

Uh oh!

seberg commented May 12, 2019

Uh oh!

eric-wieser commented Jun 5, 2019

eric-wieser commented Jan 17, 2018 •

edited

Loading

eric-wieser Jan 18, 2018 •

edited

Loading

eric-wieser commented Jan 19, 2018 •

edited

Loading