-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.__getitem__ #8905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So it looks like this is fixing two things.
If it is possible and not too clumsy, it might be nice to separate out the object array case into its own special case if-statement (eg |
Replacing just The problem is that some array subclasses decide that indexing should never cause scalars to unwrap, such that both However, there is a test in
These were already here, and unfortunately, by design (this is documented behaviour). Independently, it happens that one of the heuristics for "is this a scalar" is "is the dtype of the result array different from the input", which can only happen for object arrays |
f1c55a1
to
d194239
Compare
numpy/ma/core.py
Outdated
expect_scalar = not isinstance(mout, np.ndarray) | ||
else: | ||
# much harder to tell if we should get a scalar here - not quite | ||
# correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ideal world, we would call getmaskarray()
, and do the same as the above code path. Unfortunately, this would add a hefty performance cost for the most common cases.
You're on to something here - there were two different checks for object arrays - which happened to be similar, but there for totally different reasons. The latest patch separates the heuristic from the behaviour, and refers to the PR that introduced the special case for object arrays |
dout = mvoid(dout, mask=mask, hardmask=self._hardmask) | ||
return mvoid(dout, mask=mout, hardmask=self._hardmask) | ||
|
||
# special case introduced in gh-5962 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the test failure can be avoided by having here
elif self.dtype.type is np.object_ and dout is not masked and isinstance(dout, np.ndarray):
(And, yes, having a different .data
property is more than a little annoying. But it probably was a logical choice for the original Column
class, and then what do you do for MaskedColumn
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is probably yet another argument to not have a special case for object arrays holding ndarray
and just return masked
...
Do note, that #5962 didn't quite introduce this -- it just ensured that the pre-existing behaviour continued...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see where the MaskedConstant
is being __new__
'd from, causing the problem you described below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, yes I do - this is #8679 rearing its head. np.ma.array(np.ma.masked, copy=False)
creates a new masked
constant!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that closes the loop...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit to fix that - but I need your fix as well
@ahaldane: Hit it with another rewrite to try and separate more things. If we got rid of the |
return MaskedArray(dout, mask=True) | ||
else: | ||
if mout: | ||
return masked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR breaks a test in astropy in a very strange way, in that what is returned here fails a test where masked_column[0] is np.ma.masked
(even though both seem to be masked
). I'm utterly confused and this may just be a build issue, but I don't get the same problem with current master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you construct a minimal example, or at least point me to that test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser - it is this test
A more minimal example:
from astropy.table import MaskedColumn
import numpy as np
ma = MaskedColumn([1], dtype='O')
ma.mask = True
ma[0]
# masked
ma[0] is np.ma.masked
# False
type(ma[0]) is type(np.ma.masked)
# True
Since we overwrite __getitem__
with a CPython shim (for speed with ndarray
, I also checked that:
np.ma.MaskedArray.__getitem__(ma, 0) is np.ma.masked
# False
But it has something to do with our subclass:
np.ma.MaskedArray.__getitem__(ma.view(np.ma.MaskedArray), 0) is np.ma.masked
# True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, I added just before this return in core.py
:
print(mout, dout, masked is np.ma.masked)
and the output always is True 1 True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, np.ma.MaskedArray.__getitem__(ma, 0)
still looks like ma.masked
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dout is masked
, or dout
looks like masked
but upon closer inspection is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using _data
instead of data
work, or would that break other bits of MaskedColumn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using _data
would work -- that is not used in MaskedColumn
. But it needs some thought whether that is the right approach, i.e., to what extent one wants subclasses to be able to override how data
is interpreted. Though on the other hand arguably this is so integral to how MaskedArray
works that it would be reasonable to use a private property, which indicates more clearly that "if you override this, expect breakage".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, I managed to reproduce this problem without touching the data attribute at all, so I'm inclined to leave it as is for now (see the added tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
The correct behavior here is to return a new MaskedArray, because `MaskedConstant` is pretending to be a scalar
Thanks @mhvk for spotting the problem. There's an unrelated bugfix commit in here (BUG: np.ma.array(masked) should not return a new MaskedConstant) now too |
numpy/ma/tests/test_core.py
Outdated
# All element masked | ||
self.assertTrue(isinstance(a[-2], MaskedArray)) | ||
assert_equal(type(a[-2]), mvoid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour here hasn't changed, the test was just less strict than it should be
numpy/ma/core.py
Outdated
else: | ||
# we must never do .view(MaskedConstant), as that would create a new | ||
# instance of np.ma.masked, which would be confusing | ||
if isinstance(data, cls) and subok and not isinstance(data, MaskedConstant): | ||
_data = ndarray.view(_data, type(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get so confused about this: why exactly is a view taken here? For isinstance(data, cls)
to hold, data
must be a subclass of ndarray
as well, and if subok
was true, then the above assignment of _data = np.array(data, subok=sub)
would already have ensured that type(_data) is type(data)
(independently of the other arguments). Am I missing something? If so, do add that as a comment....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason might be to ensure that a copy of the wrapper is made - as otherwise we'd modify properties of the input array. isinstance(data, cls)
is checking for a MaskedArray
subclass.
All I did here was swap the order of the if in order to make it easier to see what is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think that makes sense. If you do have another round, do add a comment...
_mask = self._mask | ||
|
||
def _is_scalar(m): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it is not all that helpful to have these as helper functions, and I wonder a bit about the additional hit in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other one has to be a helper function in order for the flow control to work. I made this one a helper function because we need it twice, and it makes the lower-down code generally nicer.
I could move these to be free functions, if you think they'd be better outside the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm heavily biased by having used in-method functions only in cases where I want access to some of the local variables; as this is not the case here, moving it outside might be better indeed. But this is not a strong preference, so don't worry about -- my main preference would be not to have them at all, but I can see that the code flow would be a bit odd (probably would need to start by assigning scalar_expected = None
, and then at the end check whether it has change; not great either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern with moving it outside, is that it becomes distant compared to the rest of the logic. On the other hand, there's a good chance there's a bug waiting in __setitem__
that will need the same code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in cases where I want access to some of the local variables
Note that the original implementation did exactly that, but I realized it was pretty trivial to convert them into parameters, as there were only two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe best to leave as is -- this is substantially clearer than it was!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it as is sounds good to me!
I think I've addressed your other concerns too - and I reckon we leave any kind of rollback of #5962 to another PR, since it's not needed to fix the bug.
numpy/ma/tests/test_core.py
Outdated
@@ -4376,16 +4376,27 @@ def test_getitem(self): | |||
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0])), | |||
dtype=[('a', bool), ('b', bool)]) | |||
# No mask | |||
self.assertTrue(isinstance(a[1], MaskedArray)) | |||
assert_equal(type(a[1]), mvoid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also check equality of _data
and _mask
as done for element 0
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wrap this up in a helper to avoid the code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing so
This looks good, modulo the small comments (of which the in-method functions is just a preference). |
numpy/ma/core.py
Outdated
|
||
# special case introduced in gh-5962 | ||
elif (self.dtype.type is np.object_ and | ||
isinstance(dout, np.ndarray) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate giving pep8 comments, but no need for the extra indent here (only for if
clauses is it needed, since if (
takes 4 characters...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, you want columnar alignment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, with self
and isinstance
aligned; that way it is clearest what the parentheses gather together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't quite sure whether alignment or indenting was right here. Does PEP8 actually specify to wrap if
differently to elif
?
Either way, fixed, since you seem to think this is wrong, and I knew that I was unsure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment is standard, with the problem of if (
being 4 characters very specifically discussed - https://www.python.org/dev/peps/pep-0008/#indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, my real "knowledge" just comes from using emacs with flymake
-- see http://docs.astropy.org/en/latest/development/codeguide_emacs.html -- which handily marks things that don't follow style, or uninitialized/unused variables, etc.
I'm happy with this. @ahaldane - do you want to have a look as well? (I'll merge tomorrow unless I hear otherwise). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though it's unfortunate we need to take so many special cases into account.
I'm happy you wrote so many tests - I think these will be very helpful to anyone who needs to modiify this code in the future.
I'll leave it open overnight. First one to get to it tomorrow can merge!
Note that the entire thing could become |
This fixes #8684.
There's still a can of worms waiting to be opened here regardingobject
arrays that themselves contain ndarrays, but they would fail before this patch tooEdit: This can of worms has been opened, and fixed at the expense of speed