Skip to content

BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. #7194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 9, 2016

Conversation

jakirkham
Copy link
Contributor

Fixes #7164
Related: #7020
Related: #5580

Drop the __getitem__ warning. In __setitem__, check to see if the masked array is shared. If it is shared, we know it will propagate upstream in the future. Also, use a more specific warning type instead of using FutureWarning so that this can be explicitly ignored.

@jakirkham jakirkham force-pushed the fix_warning_masked_array_slices branch 2 times, most recently from a41ab10 to aa81395 Compare February 5, 2016 16:03
@jakirkham jakirkham changed the title BUG: Raise FutureWarning for masked array when it can propagate. @jakirkham BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. Feb 5, 2016
@jakirkham jakirkham changed the title @jakirkham BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. Feb 5, 2016
@jakirkham jakirkham force-pushed the fix_warning_masked_array_slices branch from aa81395 to a262155 Compare February 5, 2016 16:04
@jakirkham
Copy link
Contributor Author

This seems a bit too quiet to me, but I am not sure there is a good fix. The problem that I see is that once some masked array that shares a mask gets assigned to it will copy the mask and so will no longer share it. So, future operations with this object will not raise a warning. However, I don't see a good way to fix this. See an example below.

import numpy as np

a = np.ma.masked_array(np.zeros((2,3)), mask=np.zeros((2,3)))
b = a[0:1]

b[0] = np.ma.masked      # raises the warning (mask gets copied)
b[0,1] = np.ma.masked    # doesn't raise a warning

@jakirkham jakirkham force-pushed the fix_warning_masked_array_slices branch from a262155 to 2768476 Compare February 5, 2016 16:23
FutureWarning
)
if (self._mask is not nomask) and \
(self._sharedmask or self._data.flags.owndata):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By checking whether we own the data of the data part of the array, we can see if it is a view or not. If it is a view the mask should be a view in the future, so this is a good reason to check. This will only be worth checking if we don't already note that we share the mask.

@jakirkham
Copy link
Contributor Author

I think I fixed the former issue by checking to see if the data is a view or not.

@jakirkham jakirkham force-pushed the fix_warning_masked_array_slices branch from 2768476 to 8aea362 Compare February 5, 2016 16:47
@jakirkham
Copy link
Contributor Author

Weird. So, if I do the following, I think I am seeing a bug with ndarray or I am really not understanding how the owndata flag is working here.

import numpy as np

a = np.ma.masked_array(np.zeros((2,3)), mask=np.zeros((2,3)))
b = a[0:1]

b[0] = np.ma.masked
b[0,1] = np.ma.masked

c = b.copy()

assert c._data.flags.owndata        # Assertion fails.

@seberg
Copy link
Member

seberg commented Feb 5, 2016

Too busy right now, but do you mind having a look at my take on it as well, maybe you find some mistake. you do not expect a warning the second case, since copying the mask will set the sharedmask attribute to False?

@charris charris added this to the 1.11.0 release milestone Feb 5, 2016
@jakirkham
Copy link
Contributor Author

Oh, sorry, I copied and pasted my example code from before. I now work around this, but yes it will actually make the _sharedmask attribute False because a copy of the mask is made in the first case. That is why I check to see if _data has the owndata attribute set. However, that appears not to work as shown by this second example.

@seberg
Copy link
Member

seberg commented Feb 5, 2016

@jakirkham I really think we should put the warning to exactly where we plan to change things, which is as far as I can tell just the unshare_mask() call in __setitem__. Since I think I have tested things more carefully (i.e. you need to add an _sharedmask = False into corrcoef or something, to not get warnings which you have no influence over at all as user, I would ask you to check my stuff rather then spending too much time here. I would hate for both of us to waste even more time on it because we invent the same thing :(.

@jakirkham
Copy link
Contributor Author

My concern with adding the warning to unshare_mask is that while this does overlap with __setitem__, it is also a public API function that can be used of its own accord. Namely, it can be used to copy the mask when needed before some operation. So, this will raise some false positives that may be confusing to a user who is not using __setitem__ and is clearly not setting any values in the mask.

I'm not sure that I like recommending messing with an internal API detail like _sharedmask. It's not clear to me what negative effects one would get from this.

I felt like as I did introduce this problem and people do view it as one that I should make some effort to fix it. You're right that we shouldn't both be fixing. Currently, I am not working on this. It is merely a proposal for our discussion and definitely needs some more work as mentioned.

@jakirkham jakirkham changed the title BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. WIP, BUG: Raise a quieter MaskedArrayFutureWarning for mask changes. Feb 5, 2016
FutureWarning
)
if (self._mask is not nomask) and \
(self._sharedmask or not self._data.flags.owndata):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While owndata seems like the right sort of check, it seems that views of the data are being regularly created as part of construction a MaskedArray. We need another way to get at this information. Maybe a reference count to the original array, but am not sure how to get that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to mention that if you mask an existing array the data is a view, so that won't work. Reference counting would be a source of endless troubles. Hmm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as views on the data act only as a proxy for what we really want to know. I agree reference counting is to messy.

One thought I had is if we have some flag set here in __setitem__ to note if the mask was shared previously like self._oldsharedmask. Admittedly, if other references to the old mask are dropped, there might still be a few false positives, but trying to address that special case involves referenced counting, which I think we both want to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and implemented the solution described above.

@jakirkham jakirkham force-pushed the fix_warning_masked_array_slices branch from 8aea362 to 99342d5 Compare February 5, 2016 21:07
@charris
Copy link
Member

charris commented Feb 6, 2016

@jakirkham @seberg One thing worth thinking about is if there is a workaround that will result in the original behavior, i.e., data view and mask copy, that way people could maintain backward compatibility if they really wanted the old behavior.

@jakirkham
Copy link
Contributor Author

Simple call unshare_mask beforehand. My current proposals of warning and bug fix do not change its behavior.

@njsmith
Copy link
Member

njsmith commented Feb 6, 2016

unshare_mask gives users the ability to request the future behavior now,
but not to continue to get the old behavior in the future, right?

On Fri, Feb 5, 2016 at 5:13 PM, jakirkham notifications@github.com wrote:

Simple call unshare_mask beforehand. My current proposal of warning and
bug fix do not change its behavior.


Reply to this email directly or view it on GitHub
#7194 (comment).

Nathaniel J. Smith -- https://vorpus.org http://vorpus.org

@jakirkham
Copy link
Contributor Author

No, the other way around. It will make sure a copy of the mask is made and used as the new mask. So, it will stop propagation to the former mask's views as they won't share the same data any more.

@njsmith
Copy link
Member

njsmith commented Feb 6, 2016

Ah, okay, right, sorry, I had it mixed up.

@jakirkham
Copy link
Contributor Author

I should add that doing so with this change to the warning will silence it. Though any trivial masks like those in matplotlib will be silenced anyways.

@seberg
Copy link
Member

seberg commented Feb 6, 2016

The only way I can think of, is to do ma_arr._sharedmask = False
which will pretty much give you the new behaviour (and you may have to
do it to out arguments once in a while as well).
It is ugly as hell, but I put it into the my version of the warning.

" will propagate values back through all masks that are present.",
FutureWarning
)
if (self._mask is not nomask) and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enclosing parenthesis are preferred to line continuations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This should be corrected.

@charris
Copy link
Member

charris commented Feb 9, 2016

So I probably don't quite understand all the nuances of what happens when a view is taken, which I presume is where all the trouble is, but if _oldsharedmask is set in __getitem__ then the check in __setitem__ would be something like

self._oldsharedmask = getattr(self, "_oldsharedmask", self._sharedmask)
if self._oldsharedmask != self._sharedmask:
   ...

Anyway, it does seem to me that if you know what the propective changes will be (and that should be layed out before this), then the best thing would be to issue the warning where it makes a difference.

@charris
Copy link
Member

charris commented Feb 9, 2016

And now I've gone back and reviewed the original PR, which I had forgotten about. The nomask case really does add a complication. In fact the whole thing is sufficiently complicated that I'm tempted to push this off until we do have a clear PR that has been discussed that implements the new behavior so that it can be discussed in detail.

@mhvk, @ahaldane Thoughts?

@jakirkham
Copy link
Contributor Author

It feels to me like this is being overthought. The reality is if a slice is taken and the mask is non-trivial _sharedmask gets set to True. Normally when __setitem__ is called, it takes a copy of the mask (if it is non-trivial) so it is no longer a view.

All we want to do is raise a warning in __setitem__ when the mask could have been a view before. This will be one of two cases.

  1. The mask is shared when entering __setitem__ (_sharedmask is True).
  2. At some point previously the mask was shared and never intentionally unshared (_oldsharedmask is True).

This is all we are doing with this warning. I am not sure how I can help make this any clearer. Please ask questions if you have them.

@jakirkham
Copy link
Contributor Author

The behavior is implemented in that PR and has been discussed. It also has been on the mailing list at least once (possibly twice IIRC). The __setitem__ tests all pass. I merely have one failure within one expression in a test using putmask due to a merge conflict that I need to look into, but if we can't make some traction on this warning I don't think it is time well spent for me to try to fix this.

@charris
Copy link
Member

charris commented Feb 9, 2016

@jakirkham Let me think on it for another night. I want to be clear about what is going on and I think you would agree that the masked array code takes a bit of work to understand.

@jakirkham
Copy link
Contributor Author

Ok, that's fair. I need to do some other things atm, but will check back in tomorrow. Please let me know if you have questions.

@seberg
Copy link
Member

seberg commented Feb 9, 2016

OK, now I get the oldsharedmask idea. I am not sure I feel it is very necessary. We mostly worry about propagating values back to the parent, this most likely is either bad already at the first warning, or never. It is true that continuing to warn might help finding errors in some cases, so it is a nice idea, but I am not quite convinced it is necessary or that helpful.

For simplicities sake, I still prefer to placing the warning exactly where the code change will occur in the future. Note that my "added complexity" with reference counting only silences some additional spurious warnings in cases that cannot be affected by the change (unless someone does weakref crazyness, or has hooks in the mask arrays destruction).

Probably everyone knows these things, but in case it might make it a bit more clear. The unshare mask logic, is a mechanism designed to prevent back-propagation to the parent in:

parent = some_masked_array
# a child might be either of:
child = MaskedArray(data, mask)  # "shares" the mask with `mask`.
child = parent[::2]  # "shares" the mask with parent._mask

child[::2] = 3  # will copy mask (and sharedmask to False)
# mask or parent._mask are unchanged since a copy was made

# It sometimes also prevents propagation from parent to child if parent is a child
# though I see this more as a side effect, we still warn about it of course:
child = parent[::2]
grandkid = child[::2]
child[::2] = 3  # will create mask (and sharedmask to False)
grandkid._mask  # will be unchanged.

# The oldsharedmask if I understand correctly will help in the sense that a second call to:
child[2] = 5 
# would warn again, would be good if someone missed there is a second setitem though for the
# most part it would seem that if they are OK to propagate the first time, they should be fine the
# second time as well.

Additionally there is the unshare_mask function which forces the mask unsharing manually (if sharedmask is true. I actually think I now agree that telling people to set _sharedmask = False might be a bad idea, since it should not be done if the array might be given to downstream (my guess is, we can just say that it is OKish for temporary arrays only, not for arrays someone other than yourself might see).

The proposed change, as far as I understand, is to simply never call unshare_mask during a __setitem__ operation, so that a copy of the mask (without also copying the data) will only occur if the user explicitly asks for it.
There is exactly one call to unshare_mask in numpy, and that is a single branch of __setitem__, so warning before this call occurs seems quite specific.

There are also many many places where sharedmask should be set to False, but is not, which is a different issue (and a few extra copies never killed anyone I guess).

charris added a commit that referenced this pull request Feb 9, 2016
BUG: Raise a quieter `MaskedArrayFutureWarning` for mask changes.
@charris charris merged commit 920c527 into numpy:master Feb 9, 2016
@charris
Copy link
Member

charris commented Feb 9, 2016

Well, let's give this a shot in 1.11.0b3.

@jakirkham
Copy link
Contributor Author

Thanks @charris.

@jakirkham
Copy link
Contributor Author

OK, now I get the oldsharedmask idea. I am not sure I feel it is very necessary. We mostly worry about propagating values back to the parent, this most likely is either bad already at the first warning, or never. It is true that continuing to warn might help finding errors in some cases, so it is a nice idea, but I am not quite convinced it is necessary or that helpful.

Glad you follow, @seberg. Sorry I didn't explain it better before. You may be right, but at this point I think any adjustment to this warning will try to follow the changes in the patch, but I will need to revisit the patch to make sure its merge conflicts have been resolved correctly.

For simplicities sake, I still prefer to placing the warning exactly where the code change will occur in the future.

Agreed. Will need to fix some merge conflicts with the old patch to make sure that we know exactly where they are and they are correct.

Probably everyone knows these things, but in case it might make it a bit more clear. The unshare mask logic, is a mechanism designed to prevent back-propagation to the parent in

Thanks for doing a better job explaining it than I did. :)

I actually think I now agree that telling people to set _sharedmask = False might be a bad idea, since it should not be done if the array might be given to downstream (my guess is, we can just say that it is OKish for temporary arrays only, not for arrays someone other than yourself might see).

Glad to hear you are coming around on this. I suppose we could suggest some kind of inheritance that fixes this or a monkey patch (only suggested as inheritance of NumPy arrays is a bit complex) to get this behavior earlier to avoid accidentally messing up _sharedmask.

The proposed change, as far as I understand, is to simply never call unshare_mask during a setitem operation, so that a copy of the mask (without also copying the data) will only occur if the user explicitly asks for it.

That's the idea and you are probably right we can just put the warning there. I think there was one tweak with fields of record arrays, but they should still hit this warning.

@jakirkham jakirkham deleted the fix_warning_masked_array_slices branch February 9, 2016 19:37
@seberg
Copy link
Member

seberg commented Feb 10, 2016

Just to note, I am very sure that there is at least one spurious
warning in np.ma.corrcoef that has to be disabled by setting
_sharedmask = False on a temporary array.

I have not tested this code when it comes to spurious errors in the
test suit (I had with my stuff), so it probably is fine, but also the
nanfuncs should definitely not give warnings.

@jakirkham
Copy link
Contributor Author

I doubt we will ever get perfect precision and recall here. Though I will contend that the changes should help make a noticeable improvement on the situation.

@seberg
Copy link
Member

seberg commented Feb 10, 2016

Sure, but we should (really must) avoid all warnings we know of and
that are just scaring users for no reason :)

@jakirkham
Copy link
Contributor Author

Once confirmed, we can look at any and all special cases that appear.

I think the better thing to do though is to make sure the bugfix patch has its merge conflicts fixed and tests passing. After this is done, the warning can probably be cut down a bit more by just emitting warnings where the patch changes behavior.

@ahaldane
Copy link
Member

Just to note in case we want to make it even quieter after this, in #7020 I suggested to only raise the warning in __setitem__ if the maskedarray is a view (detected by data.owndata):

if (self._mask is not nomask) and not self.data.owndata: 
    warnings.warn('...')

this way the warning only shows up in code which attempts to assign to a view of a masked array. I'm not sure exactly how much that solution overlaps with the one here involving oldsharedmask.

@charris
Copy link
Member

charris commented Feb 11, 2016

@ahaldane The problem is that owndata cannot be used to detect a view.

In [1]: ma.array([1,2,3]).flags
Out[1]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

@jakirkham
Copy link
Contributor Author

FWIW, @ahaldane, that is what I tried first, but it turns out that all masked arrays that create a view of an ndarray or ndarray subclass. So, they never own their data even though nothing else does either. If you say this is unintuitive and needs to be fixed, I would completely agree. Though I don't plan to tackle that here and changing it without breaking something may be a bit trickier than one might initially think.

@njsmith
Copy link
Member

njsmith commented Feb 11, 2016

@jakirkham: OWNDATA is just not a reliable sign of whether there are multiple user-exposed views of the same array -- masked arrays are far from the only place that acts like this. I wouldn't bother trying to fix it; of course it's best to have a signal that's 100% reliable, but in some ways it's worse to have a signal that's 99% reliable than a signal that's 90% reliable, because with the 99% reliable signal people are more likely to rely on it.

@jakirkham
Copy link
Contributor Author

Sorry if that was confusing, I am merely agreeing with @charris that using OWNDATA will not work here. When I initially wrote the warning I tried it, but then I remembered this is completely unreliable for masked arrays. Though I was not aware that OWNDATA was unreliable elsewhere.

As far as thinking about improving the warning, I was only considering this if it was found to be causing issues somewhere else. Thus far, (and this may be premature), this does not seem to be the case. If that changes, then I am happy to rework the warning patch.

@ahaldane
Copy link
Member

Hmm, fair enough, I didn't realize that.

@seberg
Copy link
Member

seberg commented Feb 11, 2016

I would have thought you use owndata on the mask only. Which I am not
certain works, because I think the mask is sometimes even just a
reference (another bug in my opinion, it should always be at least a
view). Which is why I added the refcounting additionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants