Skip to content

BUG, DEP: Fix masked arrays to properly edit views. ( #5558 ) #5580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 9, 2017

Conversation

jakirkham
Copy link
Contributor

Fixes #5558

A detailed description of the bug is available at issue #5558. In short, when changing a view of a masked array, the data is properly propagated to the original, but changes to the mask are not. This pull request attempts to address this by ensuring the mask will also be properly altered.

@charris
Copy link
Member

charris commented Feb 23, 2015

I think this is another one of those policy questions that need discussion on the list. It looks like the current behavior was intentional.

@jakirkham
Copy link
Contributor Author

Posted to the mailing list.

@charris charris added this to the 1.10.0 release milestone Apr 7, 2015
@jakirkham
Copy link
Contributor Author

@charris, how do we know if we have reached consensus? It seems like I haven't heard any disagreement about this fix. It just needs tests for another use case.

@@ -3109,8 +3109,6 @@ def __setitem__(self, indx, value):
_mask[indx] = tuple([True] * nbfields)
else:
_mask[indx] = True
if not self._isfield:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the default here is self._sharedmask = True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why self._isfield was important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start by explaining what is going on and then answer your questions if that is alright. Please forgive me if I go into too much detail and explain the obvious. Good questions by the way. Sorry if my answer gets too long.

This particular branch is dealing with the case of a structured array. Suppose for example we have a structured array created like so a = np.zeros((2,), dtype=[("c",int, 2), ("d", float, 3)]). We could construct a masked structured array like so b = np.ma.array(a, mask=np.ma.getmaskarray(a)). In this example, we have two fields c and d. If we index into one say c like so b["c"], we will get a normal masked array.

However, there is no way to know that this masked array came from a structured masked array; so, the _isfield member variable indicates this. One can verify this is the case by trying b._isfield and b["c"]._isfield.

Now, why was _isfield checked? My belief, as I can't speak for the author personally, is the author correctly assumed that if we are looking at a field of a structured array our mask is a view of the mask from the original structured array's mask and we want to ensure changes here are propagated back to the original and so this must remain shared if it already is (this indirectly answers your first question). However, if it is not shared, it will remain the case.

To directly answer your first question, we can't assume anything about self._sharedmask. Its value will be dependent on what .

The argument that I am making here is that this one can't assume this is not a view if it is not a field. In other words, we could always be working with a view of another masked array where we want changes propagated to an original and should not be forcing the user's hand here. Additionally, it would be inconsistent to "unshare" the mask as the data remains shared regardless.

@charris
Copy link
Member

charris commented Jun 12, 2015

I'd say the consensus (of one other ;) was that the behavior needed changing, but it worries me. Have you tested this with any other packages, matplotlib for instance?

Also needs a rebase.

@jakirkham
Copy link
Contributor Author

Thanks for reviewing this @charris. I'll block out some time this weekend to both respond to your questions/suggestions and test this a bit more rigorously.

@jakirkham
Copy link
Contributor Author

It's true, unfortunately, the conversation didn't attract as many people as either of us might have hoped.

I have added some responses to give you more insight to the changes that are proposed here. I have not had time to test this with other packages and will try to set aside some more time to do that. Are there any other packages that you recommend?

Agreed that this needs a rebase, but I will hold off until you have had some time to see/respond to my comments otherwise they get buried on GitHub.

self.assertTrue(y2._data.__array_interface__ == x1.__array_interface__)
#self.assertTrue( y2.mask is m)
self.assertTrue(y2._mask.__array_interface__ == m.__array_interface__)
#self.assertTrue( y2.mask is m3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after the #. It isn't clear that this is a comment rather than a commented out statement, it would be better something like
# check that y2.mask is m3 in (whatever the case is)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it is a commented out statement, it should be removed. Frankly, I don't understand what is going on here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I wasn't clear as to why these commented statements were kept either. I think I tested the original code with those comments removed, but it failed if I recall correctly. Sure, we can drop them completely.

@charris
Copy link
Member

charris commented Jun 18, 2015

@jakirkham Needs a rebase also.

@ahaldane
Copy link
Member

Well, I read through it twice now in the hope of better understanding maskedarray, but it's a bit unclear.

It does seem like all the unshared mask stuff should be removed - the code makes more sense without, and it fixes a bug. But then it is quite mysterious why that code was there in the first place.

@jakirkham
Copy link
Contributor Author

@charris Thanks for taking another look at this. I will make some changes and rebase this.

@ahaldane If you have any questions please ask.

@mhvk
Copy link
Contributor

mhvk commented Jun 19, 2015

Sorry to get to this very late. I liked the idea but wondered whether there really was nothing in the documentation. It turns out the current behaviour is explicitly mentioned [1]:

When accessing a slice, the output is a masked array whose data attribute is a view of the original data, and whose mask is either nomask (if there was no invalid entries in the original array) or a copy of the corresponding slice of the original mask. The copy is required to avoid propagation of any modification of the mask to the original.

Personally, I feel this somewhat contradicts the expectation that one gets a view of everything. But obviously, if we change it, the documentation should be changed as well, and should include a note on how one can get back the old behaviour (presumably, something like MaskedArray(ma.data[slice], mask=ma.mask[slice].copy()))

Note, though, that with the new behaviour things will not be consistent if the original masked array did not have a mask. For that case, its mask attribute will hold nomask, and this cannot change if I I take a slice of some part, and then mask some items in that slice view. It seems unpleasant to me that behaviour would be different depending on whether I have nomask or a mask full of False.

[1] http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html#indexing-and-slicing

@jakirkham
Copy link
Contributor Author

It turns out the current behaviour is explicitly mentioned.

It is reassuring that the current behavior was intended as opposed to being a possible mistake even if it contradicts our intuition.

Personally, I feel this somewhat contradicts the expectation that one gets a view of everything.

Agreed.

But obviously, if we change it, the documentation should be changed as well, and should include a note on how one can get back the old behaviour (presumably, something like MaskedArray(ma.data[slice], mask=ma.mask[slice].copy()))

Sure, I'll make sure to add this to this PR. It is also worth noting that the copy keyword argument will copy both (if I made and error feel free to draw my attention).

Note, though, that with the new behaviour things will not be consistent if the original masked array did not have a mask....

Yeah, nomask makes these situations difficult. If we feel this is a problem (and I do agree with you), it probably should be another discussion and a separate PR.

@charris
Copy link
Member

charris commented Jun 25, 2015

I go back and forth on this. The change does make sense, but then the original behavior was intended, no doubt for some reason that escapes us. The nomask case is also bothersome. Would it be an imposition if I put this off until 1.11-devel?

@charris
Copy link
Member

charris commented Jun 27, 2015

OK, I'm going to push this off to 1.11.0, it is more of a change than I want to see in the 1.10 release at this point.

@charris charris modified the milestones: 1.11.0 release, 1.10.0 release Jun 27, 2015
@jakirkham jakirkham force-pushed the fix_masked_array_views branch from 4bda89f to c110ced Compare July 14, 2015 14:09
@jakirkham
Copy link
Contributor Author

Sure, I don't have lots of time at present to work on this anyways.

Do you think it would be worth getting rid of nomask in favor of just having an array full of False?

Also, rebased. :)

@jakirkham
Copy link
Contributor Author

@charris, do you think it would be a worthwhile to post something on mailing list about deprecating nomask in favor of an ndarray of type bool and the same size as the original data?

@ahaldane
Copy link
Member

If we were to rewrite MaskedArray, I would definitely want to get rid of both masked and nomask, and always store the full boolean array for the mask. It's my biggest complaint, they cause half the problems in maskedarray. But I can't see how to get around the big backward compatibility break!

@charris
Copy link
Member

charris commented Jul 23, 2015

Maybe we should do a new "MaskedContainer" and try to solve some of the problems with ma. If we get a version of __numpy_ufunc__ out we could also use that.

@charris
Copy link
Member

charris commented Apr 14, 2017

The deprecation went into 1.11, so I am going to push this off to 1.14, which should be a little less than two years.

@charris
Copy link
Member

charris commented Apr 14, 2017

Close and reopen to restart testss.

@charris charris closed this Apr 14, 2017
@charris charris reopened this Apr 14, 2017
@charris charris modified the milestones: 1.14.0 release, 1.13.0 release Apr 14, 2017
@jakirkham
Copy link
Contributor Author

So NumPy 1.13 has already been branched out, right? Can this be merged?

@charris
Copy link
Member

charris commented Jun 1, 2017

@mhvk Any final thoughts on this?

@charris
Copy link
Member

charris commented Jun 1, 2017

@jakirkham Needs a 1.14 release note.

@jakirkham
Copy link
Contributor Author

Please let me know what you would like to see for a release note.

@@ -379,8 +379,8 @@ is masked.
When accessing a slice, the output is a masked array whose
:attr:`~MaskedArray.data` attribute is a view of the original data, and whose
mask is either :attr:`nomask` (if there was no invalid entries in the original
array) or a copy of the corresponding slice of the original mask. The copy is
required to avoid propagation of any modification of the mask to the original.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should note the previous behavior, the new behavior, and the release in which things changed.

@charris
Copy link
Member

charris commented Jun 1, 2017

@jakirkham Take a look at the 1.13.0-notes.rst. In the Compatibility notes section there is a FutureWarning to changed behavior subsection that you can use as a template.

@charris
Copy link
Member

charris commented Jun 1, 2017

I confess that the incompatible nomask behavior bothers me quite a bit. Maybe we should seek to fix that problem first. If there is a fix.

@charris
Copy link
Member

charris commented Jun 1, 2017

I'm thinking of something like

In [3]: class Doh(object):
   ...:     @property
   ...:     def mask(self):
   ...:         return nomask
   ...:     

In [4]: d = Doh()

In [5]: d.mask is nomask
Out[5]: True

to fix the nomask problem. The mask and hard/soft could be stored in a container that is inherited by the view, so can be changed, and the mask property accesses the container.

@eric-wieser
Copy link
Member

eric-wieser commented Jun 1, 2017

Do you mean something like:

class MaskedArray:
    @property
    def mask(self):
        if self._nomaskscalar_backcompat and not self._mask.any():
            return nomask
        return self._mask

Where _nomaskscalar_backcompat would only be set in places that would return nomask in 1.13 (some things decide that it's not worth checking, and return the full array, I think)

@charris
Copy link
Member

charris commented Jun 1, 2017

I was thinking more like storing the _mask and _hardmask in a dictionary that gets soft copied to the view. Hmm, I note that there are already (private) getters and setters for mask. There is also a view method that currently copies the mask, could add a copymask keyword to that. Just tossing ideas around here.

@charris
Copy link
Member

charris commented Jun 1, 2017

Also wondering if we need some locking around view taking if the mask is also a view. The thought is that another view could modify the mask while a new view is being taken. I suppose that this is a problem with views in general, and something that folks using them in threading environments need to take into account in any case. I wonder just how atomic ndarray indexing is?

@mhvk
Copy link
Contributor

mhvk commented Jun 1, 2017

Overall, I would like to go on, now that we've had the Future/Deprecation cycle, because I do prefer that masks are viewed just like the data.

But I also agree that we should not be in a state where the mask sometimes propagates and sometimes not... I quite like the wrapping in a new class, though also @eric-wieser's suggestion of faking the nomask attribute -- in the end, I think we may well be better off just getting rid of nomask entirely, and it would seem easier to deprecate nomask with that suggestion.

@charris
Copy link
Member

charris commented Jun 1, 2017

Given the widespread use of both mask and _mask, maybe the best way forward is to start by not using nomask internally, just make mask always an array. Then something like what Eric proposed to return nomask. It would also be good to deprecate the internals *._data and *._mask for all except ma itself. I think that would require changing the names, say to *._data__ and *._mask__. Not sure there is any reason so keep either, but not letting them get used in the wild would be good. Note that currently both are widely used outside of numpy.

Hmm, maybe we should start with a PR that changes masked arrays to always use arrays internally and see if that causes problems.

@eric-wieser
Copy link
Member

eric-wieser commented Jun 2, 2017

I'm kind of holding the opinion that maybe a complete rewrite of ma using __array_ufunc__ and views would be the best approach to compatibility - the only problem being that we'd need a new name that isn't np.ma

@jakirkham
Copy link
Contributor Author

So if you do go the rewrite np.ma route, I'd suggest actually making it a separate library from NumPy proper. The reason being that it takes a bit of work to get masked arrays right (as evidenced by all the np.ma problems). Having it as a separate library would allow it to evolve at a quicker pace and make it easier to engage community members who have thought about this problem a lot in other libraries. In the interim np.ma could be deprecated. Once this other library is stable, it could take the place of np.ma outright.

@charris
Copy link
Member

charris commented Jun 9, 2017

OK, I'll put this in as is. I expect the nomask problem can be fixed in another PR is so desired.

@charris charris merged commit fa913a8 into numpy:master Jun 9, 2017
@jakirkham jakirkham deleted the fix_masked_array_views branch June 9, 2017 18:38
@jakirkham
Copy link
Contributor Author

Thanks everyone. Glad to see this one in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Masked arrays don't properly edit views
8 participants