-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Make it possible to call .view on object arrays #8514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not sure what exactly is in it, but you might want to check #5508 before continuing. |
Ah, sorry, nvm. I remembered there was sometihng about relaxing view, but that one is not about objects. |
This is simply aiming to change |
35f82ef
to
f0ecabb
Compare
@@ -355,15 +355,37 @@ def _view_is_safe(oldtype, newtype): | |||
If the new type is incompatible with the old type. | |||
|
|||
""" | |||
from numpy.lib.type_check import find_dtype_offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import feels kinda hacky, but it can't go globally. Should I put this function somewhere else?
oldoffsets = find_dtype_offsets((oldtype, oldrepeats), object_) | ||
|
||
# everything matches - we're good | ||
if len(newoffsets) == len(oldoffsets) and all(newoffsets == oldoffsets): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way of doing this check? Converting to list
?
7bc42b9
to
a558e1a
Compare
Useful for finding the location of custom dtype objects, or np.object_ members to improve the behaviour of .view
a558e1a
to
93ba54b
Compare
…n objects This makes unique work on object arrays
93ba54b
to
abff3ab
Compare
You can also look at #5548, which was an attempt to do something similar. However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions, the way I implemented it. I welcome an attempt to fix this, btw. In #5548, the methods There are also tests there I wrote which may be useful here. |
What even is a "performance hit" when the result previous to that patch was an exception? Is it simply the lack of the |
The problem was due to "hidden" objects, as I discussed with an example in the first comment in #5548. The possibility of hidden objects means I needed to do a computationally expensive overlap check even if the As I vaguely recall, when the performance problem became a big enough issue I realized that the other changes I had made over the course of those PRs actually fixed most of the issues I had originally indended to fix, without needing object views. Therefore I was perfectly happy to simply disable object views and remove the safety checks. I think it would be great to re-enable object views, though. I think it would fix a number of open issues. It's very likely there are fixes to the hidden object problems I encountered which I didn't explore properly, and some fresh eyes on the problem are very welcome. Maybe a solution is to make the HASOBJECTS flag better reflect whether the array contains objects....? I'm not sure, because that would probably also mess with the reference counting. |
So right now, this is a somewhat more restrictive version of your PR, as it doesn't allow object members to be hidden in a view, whereas yours just stopped members being reinserted - but as a result, presumably comes at less of a performance cost. Your issue was the extra time in executing |
That's true, a more restrictive version could avoid the problems. And yes, the performance cost was in Also, I think a perfomance hit is acceptable as long as it only affects object arrays. My problem was it affected non-object structured arrays too. |
@ahaldane I'll go ahead and copy the tests from that pull request then. Would you prefer me to make |
Such as find_dtype_offsets( (float, (1024, 1024, 1024), object )
8678575
to
eb6d0e5
Compare
Adds back tests for object array views. Marks the ones that deliberately fail with knownfailif, for now
@ahaldane: Ok, your tests are copied across |
I don't have a strong opinion about where I needed the non-object fields to do byte-overlap checks; if you don't need them it sounds good to optimize them out. |
@ahaldane: So do you think this needs further changes? |
Oh I didn't realize it was finished.. I'm taking a look now. |
return | ||
# no object members is fine | ||
if not newtype.hasobject and not oldtype.hasobject: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I am still thinking about, becuase of the problem of "hidden objects".
It's true that currently, I don't think there is a way to create hidden objects because most operations involving objects are disallowed or make copies, but we want to relax those restrictions.
We will have a problem when trying to solve #5994, as I attempt in my PR #6053. That is, once we allow multi-field views like:
>>> a = np.zeros(5, dtype='O,O,O')
>>> b = a[['f0','f2']]
>>> b.dtype
dtype({'names':['f0','f2'], 'formats':['O','O'], 'offsets':[0,16], 'itemsize':24})
then b will have a hidden object at byte 8. That will be true of any implementation of #5994, besides #6053. (Currently the second line returns a copy instead of a view, thus avoiding hidden objects, but that is what we want to change).
I am trying to figure out the easiest way to get both this PR and multi-field views to interact safely. (this may involve setting limitations on the plans in #5994).
Note how this PR would allow this view, using b
from my example above:
>>> b.view('O,p,O')
>>> b['f1'] = 0 # segfault?
Perhaps the solution is to disallow multi-field views if it would create hidden objects? That also seems a little wonky. I need to think a bit more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you meant to comment on this line, because it matches the old behaviour here.
I think I see the problem now. It doesn't exist yet, right, but would be a consequence of allowing partial views.
But as it stands, this pr doesn't allow the creation of those partial views in the first place. I believe this pr allows a strict subset of the things a complete partial-view implementation would allow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I guess the result would be that all this gets thrown out when partial views do make it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the best solution is just to store a reference to the original dtype for partial views, or simply a list of hidden object offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this:
First, I think we only need some relatively minor updates (in a future PR, when multi-field views are enabled) so that the hasobject
field is guaranteed to be True
if the dtype has objects including hidden objects. Thus, whenever taking a view of an array with objects, the view dtype will have hasobject=true
even if the objects are hidden.
This is actually often the case as implemented in #6053 - with c = np.zeros(4, dtype="p,O,p")[['f0','f2']]
, I see that c.dtype.hasobject
is True
. However, I'm pretty sure I saw other cases where I can get hasobject
to be false if there are hidden objects. Those would need to be fixed.
Then, this PR also needs a modification: The view should only be allowed if all the fields (not just object fields) are present at the right position in the view. This way b.view('O,p,O')
is disallowed since the p is missing in a
. (And if hasobject=False for both inputs, the view is allowed).
I think that change to this PR still allows most of the use-cases you've pointed out.
@eric-wieser Needs rebase. What do you want to do with this? |
return (base_offsets[:,None] + sub_offsets).ravel() | ||
|
||
# record type - combine the fields | ||
if dtype.fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for when i get back to this: Needs is not None
, and a test for np.dtype([])
Right now you can do this with primitive types:
This adds
Which would previously error
This makes it possible to use
np.unique
on 2d object arrays