Skip to content

BUG: void .item() doesn't hold reference to original array #8157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2018

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Oct 14, 2016

Fixes #8129 by making void.item() return a memoryview which holds a reference to the original array.

Note this has very poor performance (eg in the example in #8129). The problem is that the only way I can see to create a memoryview that holds a reference to the base object is with PyMemoryView_FromObject (and not PyMemoryView_FromMemory or PyMemoryView_FromBuffer). So I am forced to create a size-1 ndarray for every element of the array to pass to PyMemoryView_FromObject and convert it to a memoryview, which is slow.

Is there a better way? If not, hopefully this is a rare enough scenario that we don't need to care too much about performance.

Here is a list of relevant docs, for reference:

@charris
Copy link
Member

charris commented Oct 16, 2016

Hmm, looks like you have researched the problem pretty deeply. Maybe @pv has some ideas.

@charris
Copy link
Member

charris commented Oct 16, 2016

Wow, that issue died as a python issue long ago. Maybe give them a ping.

npy_intp dims[1];
dims[0] = 1;
Py_INCREF(descr);
u = PyArray_NewFromDescr(&PyArray_Type, descr, 1, dims, NULL, ip,
Copy link
Member

@charris charris Oct 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this PEP8 compliant. Starting a new line after the opening ( might be the easiest way.

EDIT: Or just declare a flags variable, would make it more understandable...

PyObject *ret, *u;

/* first create a size-1 view of this element */
npy_intp dims[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a blank line after the declaration. I'd move it up and move the comment down after the declarations.

va = np.zeros(300000, 'V4')
x = va[:1].item()
del va
x.tobytes() # segfault?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always segfaults, or just now and then? Might move the trailing comment onto its own line above.

@charris
Copy link
Member

charris commented Oct 17, 2016

LGTM modulo some nits. I wonder a bit if returning memoryviews could cause backward compatibility problems, although it isn't clear that things were working before. I'm leaning to put this off to the next release so that things have a chance to settle. Maybe we should warn in the release notes?

@ahaldane
Copy link
Member Author

Yeah, I want to think a little more anyway, and I also want to try running pandas/astropy/scipy unit tests with this change.

There might be a workaround to the performance problem through the fact that numpy usually considers unstructured void scalars to be "immutable", so often makes them copies instead of views. So maybe I can avoid the call to PyArray_NewFromDescr by making a cheap copy of the element of some sort. Eg, getting a void-scalar might be cheaper than getting a view.

@homu
Copy link
Contributor

homu commented Nov 25, 2016

☔ The latest upstream changes (presumably #8235) made this pull request unmergeable. Please resolve the merge conflicts.

@ahaldane ahaldane force-pushed the void_item_memview branch from 84bbfc7 to 28b35d6 Compare April 22, 2017 04:50
@ahaldane
Copy link
Member Author

ahaldane commented Apr 22, 2017

Rebased.

Also, realized a simple solution to the problems above: Just return a memoryview of ap. I think this PR is ready now.

Note this PR will have a significant effect on the string representation of V arrays, since V items will now always show up as <memoryview>:

Old str behavior:

>>> np.zeros(6, 'V4')
array([, , , , , ], 
      dtype='|V4')

New behavior:

>>> np.zeros(6, 'V4')
array([<memory at 0x7fd9bd3a6808>, <memory at 0x7fd9bd3a6808>,
       <memory at 0x7fd9bd3a6808>, <memory at 0x7fd9bd3a6808>,
       <memory at 0x7fd9bd3a6808>, <memory at 0x7fd9bd3a6808>],
      dtype='|V4')

I suggest that the new behavior is more correct, although wordier. This way we don't risk cluttering the terminal with invalid characters, and we don't end up with mystery spaces like in the old representation.

Edit: Although... the memory addresses printed are a bit misleading. I think they are the addresses of the temporary void scalars created in array2string. Void scalars are always copies of the array elements, so the memory locations are not those of the original array. Probably the memory locations are sometimes repeated because one temporary scalar is desctroyed before the next is created. I could hide the addresses by defining voidtype_repr, in this PR or a later one...

@eric-wieser
Copy link
Member

Does the following work fine under this patch?

dt = np.dtype([('i', int), ('v', 'V0')])
np.zeros(6, dt)

@ahaldane
Copy link
Member Author

ahaldane commented Apr 22, 2017

Your example is just printed differently:

>>> dt = np.dtype([('i', int), ('v', 'V0')])
>>> np.zeros(6, dt)
array([(0, <memory at 0x7ffa3f09d240>), (0, <memory at 0x7ffa3f09d240>),
       (0, <memory at 0x7ffa3f09d240>), (0, <memory at 0x7ffa3f09d240>),
       (0, <memory at 0x7ffa3f09d240>), (0, <memory at 0x7ffa41accb70>)],
      dtype=[('i', '<i8'), ('v', 'V')])
>>> m = _[0].item()[1]
>>> m.format, m.itemsize
('0x', 0)

(also, this example uses a structured type, but this patch mainly affects unstructured void types)

@eric-wieser
Copy link
Member

My concern was handling of V0, but it seems it was without reason.

Sure looks like a bug in python, the way that memoryview is being shown - shouldn't that show the address of the memory it points to, not the address of the pyobject holding it?

@eric-wieser
Copy link
Member

Also, I don't agree with your old behaviour - on python 3, I get:

>>> np.zeros(6, 'V4')
array([[0 0 0 0], [0 0 0 0], [0 0 0 0], [0 0 0 0], [0 0 0 0], [0 0 0 0]], 
      dtype='|V4')

Are you using 2.7?

@ahaldane
Copy link
Member Author

It is showing the address being pointed to.

The problem is that during the process of printing the array, numpy makes a temporary copy of the memory and shows that instead.

More in detail, numpy goes through each element, and makes a void scalar from it, and unstructured void scalars (unlike structured voids) always make a copy (not a view) of the array element they correspond to. So the printed memory refer's to an unstructured void scalar's buffer.

We might fix that (in another PR?) by making unstructured void scalars point to the original array's memory (be views). That would also fix the problem that writing to an unstructured void scalar's memory currently doesn't affect the original array, unlike structured void scalars.

The place where this all happens is in PyArray_Scalar.

@ahaldane
Copy link
Member Author

I'm using Python 3.6.0 in the printout above.... curious yours is different.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 22, 2017

It is showing the address being pointed to.

Can you prove that to me with:

arr = np.zeros(6, 'V4')
x = arr[0]
print(repr(x))
print(object.__repr__(x))

I see your point though, that in our case, neither address is at all useful.

Since void is returned by copy, I think it's very important that the repr reflects this - right now, its very indicative of a view

@ahaldane
Copy link
Member Author

>>> arr = np.zeros(6, 'V4')
...: x = arr[0]
...: print(repr(x))
...: print(object.__repr__(x))
...: 
<memory at 0x7f22d876b240>
<numpy.void object at 0x7f22d8790570>

@ahaldane
Copy link
Member Author

Ah, fair enough, done.

@@ -416,6 +416,10 @@ The printing style of ``np.void`` arrays is now independently customizable
using the ``formatter`` argument to ``np.set_printoptions``, using the
``'void'`` key, instead of the catch-all ``numpystr`` key as before.

unstructured void array's ``.item`` method now returns a bytes object
---------------------------------------------------------------------
``.item`` now returns a ``bytes`` object instead of a buffer or byte array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth drawing attention to the fact that this new object is readonly, unlike the previous buffer or byte array that was read/write

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that reason, this maybe should be under Compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, the previous buffer or byte array wasn't a view or the original array or scalar. So writing to it had no effect. I'll add a note though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but it means that users who expected to get a mutable scratch buffer no longer do so.

I suppose the alternative would be to return a bytearray object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that too. But I suspect we would actually be causing bugs/confusion if we return a bytearray, since it vaguely implies it is a view when it isn't. Returning an immutable bytes makes clear you only got a copy.

Updated the note, by the way.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like someone else's opinion on whether changing the mutability is too backwards-incompatible.

@charris
Copy link
Member

charris commented Nov 14, 2017

I have no idea if the change in mutability will have much impact. Probably the only way to find out is to try. I suppose we could issue a FutureWarning first.

@ahaldane
Copy link
Member Author

ahaldane commented Nov 14, 2017

I doubt it will have much impact, since the old behavior was almost totally non-functional:

void.item() gave back different types objects in python 2 vs python 3 (memoryview vs array), easily caused segfaults, and, confusingly, appeared to be a view (since it was mutable) but was actually just a mutable copy, thus making it usually pointless to modify.

Edit: that last part isn't completely true... in python 2 the returned array was indeed a view if you do arr[:1].item()... but was a copy if you did arr[0].item().

@eric-wieser
Copy link
Member

So, the choice comes down to picking between

  • Return bytes, which could break old code using np.void as a scratch pad, but accurately indicates that the object is not a view
  • Return bytearray, which is (fully?) compatible with old code, but gives the illusion that it might be a view (just like it always did)

Either of these is better than leaving things how they are.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 17, 2017

As an aside, this change will make int(np.array(b'10', np.void)) work, for better or worse

@ahaldane
Copy link
Member Author

If there is any doubt, then it's probably safest to issue a DeprecationWarning for 1.14 and leave the behavior alone, and then do the actual change to bytes in 1.15. There are definitely times I wish I had been more cautious and warned first in the past!

I just tried adding the warning, and running runtests.py --raise-warnings develop, and I don't get any warnings, so adding a warning doesn't seem too noisy.

@eric-wieser
Copy link
Member

I don't see a good way to give a silenceable warning here

@charris
Copy link
Member

charris commented Nov 25, 2017

I pushed this off to 1.15, will go with the warning first.

@charris
Copy link
Member

charris commented Apr 9, 2018

@ahaldane Needs rebase. I'm looking to branch 1.15 pretty soon, should we push this off to the next release to allow the FutureWarning more time or just go ahead. I had the impression that this change was not expected to cause problems.

@ahaldane ahaldane closed this Apr 13, 2018
@ahaldane ahaldane force-pushed the void_item_memview branch from 982ed77 to 3ec8875 Compare April 13, 2018 04:11
@ahaldane
Copy link
Member Author

whoops, what happened here?

@ahaldane ahaldane reopened this Apr 13, 2018
@eric-wieser
Copy link
Member

@charris: To be clear, you want this to target 1.16?

@ahaldane
Copy link
Member Author

ahaldane commented May 1, 2018

I think @charris was asking me to rebase for 1.15, which I did. So this can stay tagged for 1.15.

@charris
Copy link
Member

charris commented May 11, 2018

OK, I'm looking to put this in. @ahaldane @eric-wieser Is that OK with you?

@charris charris merged commit 88e5cb7 into numpy:master May 12, 2018
@charris
Copy link
Member

charris commented May 12, 2018

Thanks Allan.

@ahaldane
Copy link
Member Author

Thanks Chuck! Yes it was OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants