Skip to content

ENH: Allow dtype objects to be indexed with multiple fields at once #10417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 7, 2019

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Jan 17, 2018

This returns a dtype of the same size with other fields removed, just like that used in gh-6053

It should be possible to fall back on this implementation in mapping.c:_get_field_view - would appreciate comments on that.

Some ways in which this differs from array.__getitem__:

  • Empty lists are allowed, since they're not ambiguous vs index arrays
  • Only lists are allowed, which avoids having to deal with malformed sequence objects
  • KeyError is raised for invalid field names instead of ValueError (relates to Incorrect Exception when indexing array with field. #8519)

@mhvk
Copy link
Contributor

mhvk commented Jan 17, 2018

This looks good in itself, but I wonder about indexing based on a list -- it seems to be much clearer to allow dtype['col1', 'col2'] (and tuples are more like keys to a dict, as they are hashable). Is this just by analogy with array indexing? (I don't understand why that isn't allowed in structured arrays either; it seems unambiguous.)

{
int seqlen, i;
PyObject *fields;
PyObject *names;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize these to NULL so XDECREF doesn't fail in the first few goto fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@eric-wieser
Copy link
Member Author

it seems to be much clearer to allow dtype['col1', 'col2']

Note that then dtype['col'] and dtype['col',] are a lot less distinguishable than dtype['col'] and dtype[['col']].

@ahaldane
Copy link
Member

Otherwise code looks good. Now for the hard questions:

Will it be confusing if ndarray multifield-indexing removes padding, but dtype-multifield indexing keeps it? I kind of think so, in which case I think both should keep padding.

Is it a good idea to add more common complex behavior to dtypes, given that we eventually want to rewrite dtypes so that each dtype has its own separate code? Eg, only structured dtypes need to know about multifield indexing, it is not needed for dtype(int) for example.
I think my answer is it is OK, since this is pretty clearly for structured dtypes only, and we can just lift it out of the common dtype code the day we split up the types.

I agree that if we add this, _get_field_view can be simplified.

@eric-wieser
Copy link
Member Author

in which case I think both should keep padding.

I retract my earlier comment about that elsewhere, and now agree with you. Being consistent is most useful here.

Is it a good idea to add more common complex behavior to dtypes, given that we eventually want to rewrite dtypes so that each dtype has its own separate code?

Yes, because this is already guarded away inside the HAS_FIELDS check - it would just move under the np.void or possibly a new np.structured class, and remain largely unchanged.

I agree that if we add this, _get_field_view can be simplified.

Are you ok with the exception type changes? Perhaps I should submit a PR to adjust those in _get_field_view first

@ahaldane
Copy link
Member

On the change to KeyError, see #5636 (comment) and discussion there.

I'm for it, but it is a real backcompat issue.

@ahaldane
Copy link
Member

Perhaps we can bundle the KeyError change with the copy->view change if we delay both until 1.15. That way code-breakage is only for one release.

@mhvk
Copy link
Contributor

mhvk commented Jan 17, 2018

I'd hope that anybody who needs a single-item view would write dtype[('col1',)]... It is not something that should determine the logic, I think. Instead, it would seem to be mostly an argument of compatibility with ndarray or with dict keys. And here, compatibility with ndarray probably is more important; I brought it up mostly because it feels unnatural (and our astropy's Table class quite happily indexes with tuples of column names -- but also with lists.

@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch from 74532c5 to b596fb0 Compare January 18, 2018 08:21
@eric-wieser
Copy link
Member Author

Updated with _get_field_view implemented in terms of this

@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch 2 times, most recently from 2884320 to 23db49c Compare January 18, 2018 09:18
@@ -1157,9 +1157,11 @@ def test_structuredscalar_indexing(self):
def test_multiindex_titles(self):
a = np.zeros(4, dtype=[(('a', 'b'), 'i'), ('c', 'i'), ('d', 'i')])
assert_raises(KeyError, lambda : a[['a','c']])
assert_raises(KeyError, lambda : a[['b','b']])
assert_raises(KeyError, lambda : a[['a','a']])
assert_raises(ValueError, lambda : a[['b','b']]) # field exists, but repeated
Copy link
Member Author

@eric-wieser eric-wieser Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a new error as of 1.14, so we can change it in 1.14.1 without compatibility concerns.

Edit: see #10430 - will rebase this change on top of that

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the second commit, it becomes clearer why this is a nice thing to do! I'd still prefer a tuple, but maybe that is best treated as a separate issue -- for compatibility, we'd need to accept a list anyway.

int seqlen, i;
PyObject *name = NULL, *tup;
PyObject *fields, *names;
PyObject *list_ind;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors are because you end up not using list_ind

@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch from 23db49c to 040af92 Compare January 18, 2018 17:07
}
}

view_dtype = PyArray_DescrNewFromType(NPY_VOID);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a later commit, but: This should probably just copy the original dtype.

Py_DECREF(names);
return 0;
/* Convert sequences into a list, which is what dtype.__getitem__
expects. */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm arbitrarily deciding that descr.__getitem__ is going to be strict about the sequence type it gets passed, since it has no compatibility concerns. However, we still need to be liberal in what we accept here in array.__getitem__.

@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch 2 times, most recently from d2e564b to 1d7e45f Compare January 19, 2018 05:16
@eric-wieser
Copy link
Member Author

eric-wieser commented Jan 19, 2018

Oh heck:

Fatal Python error: ../Objects/tupleobject.c:236 object at 0x7f9744242450 has negative ref count -2604246222170760230

That refcount is (int64_t) (0xdbdbdbdbdbdbdbdb - 1), which is the result of decrefing freed memory...

@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch from 1d7e45f to b3c2116 Compare January 19, 2018 06:55
@eric-wieser
Copy link
Member Author

Gave up on the forcing exact lists - it didn't actually help all that much with the refcounting

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2018

@eric-wieser - I'm confused by your above comment. Is this PR OK, or are there still problems?

@eric-wieser
Copy link
Member Author

There were problems, so I adjusted the interface of the internal function to accept any sequence.

The remark about handing KeyError still needs resolving

@mattip
Copy link
Member

mattip commented Jan 13, 2019

@eric-wieser would you like some help finishing this?

@eric-wieser eric-wieser self-assigned this Jan 13, 2019
@eric-wieser
Copy link
Member Author

Help reminding me that it exists is enough for now - I'll assign it to myself to make it easier to rediscover.

@mattip
Copy link
Member

mattip commented Mar 10, 2019

ping. this still has an approved status but seems to need work

@mhvk mhvk self-requested a review March 10, 2019 15:47
@mhvk
Copy link
Contributor

mhvk commented Mar 10, 2019

I tried the "re-request" review button, but that did not remove my approval. In any case, I think there was little left...

@eric-wieser
Copy link
Member Author

Release note and test of the alignedstruct flag added

@eric-wieser eric-wieser removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 11, 2019
@eric-wieser eric-wieser added this to the 1.17.0 release milestone May 11, 2019
@eric-wieser eric-wieser requested a review from seberg May 11, 2019 21:00
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a test for the single "title" case, but the PR does not touch that code, so that does not matter.

decref name if an error occurs further on. */
if (PyTuple_SetItem(names, i, name) < 0) {
goto fail;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTuple_SET_ITEM is fine (and you can remove the error check).

This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053

It should be possible to fall back on this implementation in mapping.c:_get_field_view
@eric-wieser eric-wieser force-pushed the dtype-lookup-sequence branch from 597cc36 to 90f710b Compare May 11, 2019 22:19
@charris
Copy link
Member

charris commented May 12, 2019

close/reopen

@charris charris closed this May 12, 2019
@charris charris reopened this May 12, 2019
@seberg
Copy link
Member

seberg commented May 12, 2019

Used that "resolve" button, seems somewhat annoying with that marge commit though, ah well, may clean it up tomorrow (or so, or just merge it anyway...).

@eric-wieser
Copy link
Member Author

Bitten by the problems described in #13707 again

@eric-wieser
Copy link
Member Author

@seberg, happy to merge this?

@seberg seberg self-requested a review June 5, 2019 20:46
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I am about to merge. But before I do: We do not think it would be plausible for this indexing to "squeeze" the fields (remove padding), right? (I mean the dtype one, not the way it is used in indexing of course).

Just want to be sure that there is either little chance of that, or we do not care about changing it later on.

If you feel there is not. Feel free to merge yourself Eric.

/* only happens for strange sequence objects */
npy_bool is_string;
PyObject *item = PySequence_GetItem(ind, i);
if(item == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(item == NULL) {
if (item == NULL) {

Just nitpicking for no reason.

@eric-wieser
Copy link
Member Author

We do not think it would be plausible for this indexing to "squeeze" the fields (remove padding), right?

No, I think the value comes from this meaning exactly the same thing as it does in indexing arrays, and the change to make array indexing not squeeze fields was quite deliberate.

@seberg
Copy link
Member

seberg commented Jun 7, 2019

OK, I tend to agree, but wasn't into that discussion too deeply. I will just squash and merge this then.

@seberg seberg merged commit 51ee454 into numpy:master Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants