Skip to content

ENH: Let MaskedArray getter, setter respect baseclass overrides #4586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2015

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Apr 4, 2014

As is, MaskedArray uses explicity calls to ndarray.__setitem__(self.data, indx, value) and ndarray.__getitem__(self.data, indx) to access elements of its baseclass. This PR uses self.data[indx] instead, so that possible overrides of __getitem__ and __setitem__ are respected.

This is particularly useful for astropy's Quantity ndarray subclass, since it should check that anything written to it has the correct unit, and associate units for anything gotten from it.

(Note: this includes #4585, as otherwise some of the tests would fail; I'll rebase once that is in.)

@charris
Copy link
Member

charris commented Apr 4, 2014

LGTM.

@charris
Copy link
Member

charris commented Apr 12, 2014

Needs updating for the recent commits. I'm not sure why the merge button is still green.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 16, 2014

@charris - I rebased, and it passes the tests. OK to merge?

@charris
Copy link
Member

charris commented Apr 21, 2014

LGTM. I cannot figure out why ndarray.__getitem__ and friends were use instead of the usual indexing. Any ideas? The ma code probably dates back to the days of python 2.0, but I don't think that should make a difference.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 21, 2014

No ideas, I'm afraid... But it seemed good to clean up, hence also #4617.

@@ -3050,16 +3049,16 @@ def __setitem__(self, indx, value):
# if getmask(indx) is not nomask:
# msg = "Masked arrays must be filled before they can be used as indices!"
# raise IndexError(msg)
_data = ndarray.view(self, ndarray.__getattribute__(self, '_baseclass'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this line has pretty much everything the bothers me ;) Note the here _data is viewed as _baseclass, which may, but I'm not sure, have a different shape than self._data in the matrix case or some other odd case. Same thing above, where ndarray.view(self, ndarray) may be different from self.data, the latter may actually be a matrix type. I'd also like to know what the difference between self.data and self._data is, likewise self.mask and self._mask.

Dealing with masked arrays is like trying to sneak up on a tiger while creeping across eggshells. Disaster threatens at every move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charris - yes, it is weird, but it may be that this predates the following lines

    def _get_data(self):
        """Return the current data, as a view of the original
        underlying data.

        """
        return ndarray.view(self, self._baseclass)
    _data = property(fget=_get_data)
    data = property(fget=_get_data)

Similarly,

    def _get_mask(self):
        """Return the current mask.

        """
        # We could try to force a reshape, but that wouldn't work in some cases.
#        return self._mask.reshape(self.shape)
        return self._mask
    mask = property(fget=_get_mask, fset=__setmask__, doc="Mask")

So, this would seem to be safe... Or is there a MaskedMatrix class or so where parts get overwritten? My larger worry is that this was all written at a time when people were not as used to writing lots and lots of test cases...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the code actually postdates those functions you quote by about two years. Two different authors though, so we may not be the first struggling with the code ;) The mask part does look safe, although I'd like to know what the comment on line 2006 means.

I don't feel very confidence reviewing masked array code, there are a lot of subtleties that I don't understand yet...

@mhvk
Copy link
Contributor Author

mhvk commented Apr 21, 2014

I went ahead and copied test_indexing in test_core.py to a new test_matrix_indexing. These tests fail with the current master, but with the addition of a mask.shape = dout.shape, they pass with this PR. So,it looks like this might solve even the masked matrix indexing problem! (well, better first see what travis thinks of it...).

@charris
Copy link
Member

charris commented May 4, 2014

@mhvk I'm going to delay the rest of the masked array work until 1.9 is branched.

@mhvk
Copy link
Contributor Author

mhvk commented May 4, 2014

@charris - yes, that is better, certainly for this part.

@charris
Copy link
Member

charris commented Aug 24, 2014

@mhvk OK, I'm back and sinking in this quicksand ;) I would help if I had a better understanding of what quantities is doing. That is, what methods are overridden, etc. If I understand correctly, the main problem here is the scalar returned by indexing. This is likely to be a general property, although ufuncs behave correctly.

In [6]: class Foo(ndarray):
    def __new__(cls, *args, **kwargs):
        return array(*args, **kwargs).view(cls)
   ...:     

In [7]: a = Foo([1]*3)

In [8]: a[...]
Out[8]: Foo([1, 1, 1])

In [9]: a[:2]
Out[9]: Foo([1, 1])

In [10]: a[1]
Out[10]: 1

In [11]: a.sum()
Out[11]: Foo(3)

Maybe there is a deeper problem here that needs to be addressed.

@mhvk
Copy link
Contributor Author

mhvk commented Aug 26, 2014

@charris - there are problems with both __getitem__ and __setitem__ that this PR tries to address. The one with __setitem__ is most obvious: if an item is unsuitable to be stored in the baseclass (in our case, e.g., a quantity with the wrong unit), the baseclass should be allowed to raise an exception (currently, I cannot easily give an example with quantity, since it requires some of the other PRs, but see the second half of test_subclass_items).

For __getitem__, you are indeed right that in Quantity -- and in the test case in this PR -- we are working around what one might well consider a deeper problem, that indexing a single item does not preserve subclass. But even so, by analogy with __setitem__, it would seem reasonable to just use the getter of baseclass rather than that of ndarray. (I really have no idea why MaskedArray was written to explicitly use ndarray.__setitem__ and ndarray.__getitem__.)

From a broader perspective, it would seem that MaskedArray effectively is a container class. As such, it would seem to make sense to let it make minimal assumptions about what it contains.

@mhvk mhvk force-pushed the ma/subclass-item-setting-getting branch from 613e061 to eea2342 Compare August 26, 2014 20:16
@mhvk
Copy link
Contributor Author

mhvk commented Aug 26, 2014

p.s. I rebased, which meant moving the note about this change from 1.9 to 1.10.

@charris
Copy link
Member

charris commented Jan 25, 2015

@mhvk OK, I'm ready to walk into this swamp again ;) Could you rebase?

@mhvk mhvk force-pushed the ma/subclass-item-setting-getting branch from eea2342 to aca4d1d Compare January 25, 2015 21:05
@mhvk
Copy link
Contributor Author

mhvk commented Jan 25, 2015

@charris - OK, rebased. Haven't looked at this myself for the past year, but still think the essence here is logical: let the baseclass handle things were appropriate.

@ahaldane
Copy link
Member

@mhvk I noticed some similar problems when combining ma and matrices, maybe you have an idea. Consider

>>> ax = matrix(np.random.rand(3,3))
>>> mx = np.ma.array(ax, mask=(np.random.rand(3,3)) < 0.5))

If you do ax*ax it performs a dot product, but doing mx*mx semi-unexpectedly performs a simple multiply.

Now, much like the changes in this PR, we could try to fix it by making MaskedArray.__mul__ end up calling self.data.__mul__ rather than (as currently) np.multiply(self.data. But my question is, how would we know that the mask should also be updated according to a dot product rather than multiply?

Similarly, a subtype might make certain ndarray methods return the elements in a different order, or in a different shape. How would np.ma know how to update the mask in those cases? (This is already sort of a problem with matrices which sometimes add dimensions of size 1, which do not get added to the mask, as I noticed here).

These problems make it seem very tough to me to make masked arrays interoperate well with arbitrary ndarray subtypes.

@mhvk
Copy link
Contributor Author

mhvk commented Mar 28, 2015

@ahaldane - when I wrote this PR I was mercilessly unaware of the pitfalls of matrices! Your example indeed shows a big one: either the masked version operates differently from the unmasked one, or the mask gets out of whack.

I think there are two parts to a possible solution. First would be for MaskedArray not to define __mul__, etc., but purely work at the ufunc level (i.e., with __array_prepare__, __array_wrap__, or, better, __numpy_ufunc__). Second to ensure that it takes on the methods what is being masked, so that the methods of the enclosed class are available. Then, when one calls mx * mx, this leads to Matrix.__mul__, which calls np.dot, which will first be passed through MaskedArray.__numpy_ufunc__, and there the mask can be dealt with, with the proper ufunc in hand.

But I'm far from sure this would work for all cases; in end, my sense would be that in terms of broadcasting, etc., MaskedArray should simply be able to assume that a subclass behaves like an ndarray. But obviously it does not currently, so this would break backwards compatibility.

Now on your specific example, I should add that my impression was that the matrix class has become such a problem that most people simply do not use it. Indeed, in python3.5 a new operator @ and associated __matmul__ method will be specifically introduced to make matrix operations with regular arrays easier (see http://legacy.python.org/dev/peps/pep-0465/).

@ahaldane
Copy link
Member

@mhvk thanks for the helpful explanations. I want to better understand what the "end goal" might be for masked arrays, before I start making smaller changes.

In the case of dot, there are ways I see that almost work through varous combinations of views, but I still haven't found one that really works. The problem with your suggestion is that dot (and sum and others) are not ufuncs, and that matrix.__mul__ never goes through the ufunc machnery. A solution for dot is not going to involve ufuncs at all, I think. Also, even if matrix multiplication is fixed up in python3.5, there is still a similar problem with sum and others. Anyway, I'll keep thinking about it.

@charris charris added this to the 1.10.0 release milestone Apr 7, 2015
@charris
Copy link
Member

charris commented Apr 22, 2015

@mhvk Could you rebase this again. It's probably just the release notes causing problems.

@mhvk mhvk force-pushed the ma/subclass-item-setting-getting branch from aca4d1d to 7a84c56 Compare April 22, 2015 23:09
@mhvk
Copy link
Contributor Author

mhvk commented Apr 22, 2015

@charris - ok, rebased. As you guessed, it was just the release notes -- and travis still passes!

@charris
Copy link
Member

charris commented Jun 4, 2015

I've convinced myself that the gain is worth the possible pain, so in it goes.

charris added a commit that referenced this pull request Jun 4, 2015
ENH: Let MaskedArray getter, setter respect baseclass overrides
@charris charris merged commit 36b9404 into numpy:master Jun 4, 2015
@charris
Copy link
Member

charris commented Jun 4, 2015

Thanks @mhvk .

@mhvk mhvk deleted the ma/subclass-item-setting-getting branch June 4, 2015 00:54
mhvk added a commit to mhvk/numpy that referenced this pull request Jun 18, 2015
For a masked array holding objects that themselves are arrays, when selecting a
single item, it is treated as if it were a slice of the array and an attempt is
made to set its mask.  This was always a bug, but it become visible with a
recent change to `MaskedArray.__getitem__` (numpygh-4586) where it is attempted to
change the shape of the mask.  With this PR, this case gets special  treatment
in that the object is made into a Masked Array with a full set mask.
(A previous attempt to do the perhaps more logical think and just return
`masked` caused quite a few errors in astropy.io.votable; it seemed it was
not worth breaking backwards compatibility that much).

Test case that now works but used to fail:
```
mx1 = MaskedArray([1.], mask=[True])
mx2 = MaskedArray([1., 2.])
mx = MaskedArray([mx1, mx2], mask=[False, True])
mx[0]
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants