Skip to content

ENH: Let MaskedArray getter, setter respect baseclass overrides #4586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/release/1.10.0-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,12 @@ arguments for controlling backward compatibility of pickled Python
objects. This enables Numpy on Python 3 to load npy files containing
object arrays that were generated on Python 2.

MaskedArray support for more complicated base classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Built-in assumptions that the baseclass behaved like a plain array are being
removed. In particalur, setting and getting elements and ranges will respect
baseclass overrides of ``__setitem__`` and ``__getitem__``.

Changes
=======

Expand Down
27 changes: 14 additions & 13 deletions numpy/ma/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -3043,8 +3043,7 @@ def __getitem__(self, indx):
# if getmask(indx) is not nomask:
# msg = "Masked arrays must be filled before they can be used as indices!"
# raise IndexError(msg)
_data = ndarray.view(self, ndarray)
dout = ndarray.__getitem__(_data, indx)
dout = self.data[indx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this should be

dout = self.view(type=ndarray)[indx]

because it is possible that the _baseclass overrides the indexing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you logic (especially after the troubles with masked matrices for .flat...; indeed, cc @seberg) but one of the main points of this PR is actually that, to allow the baseclass to have its say in getting and setting. For astropy's Quantity, getting a proper slice would work fine, since that gets turned back into a MaskedArray and is updated (which is important as it copies the Quantity's unit). But if one gets a single item, ndarray's __getitem__ leaves one with an np.float64 object, which, if not masked, will be returned as is, i.e., one has lost the unit (or, more generally, removed the baseclass aspect of the item).

Indeed, the above is why we have a Quantity.__getitem__ : we return a single item as a a̶r̶r̶a̶y̶ ̶s̶c̶a̶l̶a̶r̶ 0-dimensional array instead (and have an internal property that remembers that this is a scalar).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to think of this, except that it seems like this "masked matrix" stuff was always broken with indexing... It likely makes sense to actually change this, the finalization stage should attempt to fix the mask shape (needs test with matrix). But shape fixing in the data array and not having the same thing in the mask array, still seems like asking for trouble to me in any case (i.e. the whole reason why the .flat was needed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem to make sense to allow the _baseclass to override the indexing. I did check scipy with your changes and didn't see anything the looked related to this. Hmm... could you add a note here and we'll give this a try.

Looks like the code in MaskedArray could use a cleanup, if only for clarity.

Curious that _data and data are the same thing, likewise _mask and mask. I wonder why that was done? Backward compatibility? Intent to deprecate one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masked array could probably use a rewrite using __numpy_ufunc__ anyway when it is time...

Isn't data a property and _data the real array? Though I think you are probably right and it is identical. If it doesn't, it would be better if .data would return a view on _data and not the object itself.

About the matrix stuff/subclasses that change the shape... This makes enough trouble in numpy, getting it to work right in ma is probably hell and buggy at best at this time (just my guess on things, I never even used masked arrays ^^). For what I care, I would be fine with deprecating all that .flat usage. Do we really need to support subclasses that mess with the shape inside masked arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charris - It is indeed odd that _data and data are the same thing. For _mask and mask it looks more like the usual private property shielded by a getter and setter.
@seberg - definitely would be great to move to __numpy_ufunc__... but it does need someone to sit down and be careful...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y'all, sorry for not being able to be more responsive on all these problems. The _data is indeed the private property accessing the underlying ndarray (just like _mask stores the boolean ndarray corresponding to, you guessed it). Neither _data or _mask are supposed to be accessible to the lay person, hence the data and mask properties. Well, hence the mask: the data was introduced to give a quick access to the ndarray in cases where you'd want to drop the MaskedArray part for a while, perform faster computations and pull the mask back at the end.
In the problem at hand, the objective is to get the underlying data in its 'least-adulterated' form: @charris is right, it should be `dout=self.view(...)[indx].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierregm - the problem with doing dout=self.view(np.ndarray)[indx] is that, as mentioned above, this presumes that a single item from a subclass can be represented as a np.float64 scalar. This, however, is false at least for astropy's Quantity, where one does need to keep the unit.
Note that the problem is worse for __setitem__: the baseclass may not allow one to set parts to just anything, it may need to check that, e.g., the data put in has the same class (or, for Quantity, that the units are consistent, and convert if necessary).

# We could directly use ndarray.__getitem__ on self...
# But then we would have to modify __array_finalize__ to prevent the
# mask of being reshaped if it hasn't been set up properly yet...
Expand Down Expand Up @@ -3074,6 +3073,8 @@ def __getitem__(self, indx):
# Update the mask if needed
if _mask is not nomask:
dout._mask = _mask[indx]
# set shape to match that of data; this is needed for matrices
dout._mask.shape = dout.shape
dout._sharedmask = True
# Note: Don't try to check for m.any(), that'll take too long...
return dout
Expand All @@ -3091,16 +3092,16 @@ def __setitem__(self, indx, value):
# if getmask(indx) is not nomask:
# msg = "Masked arrays must be filled before they can be used as indices!"
# raise IndexError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this line has pretty much everything the bothers me ;) Note the here _data is viewed as _baseclass, which may, but I'm not sure, have a different shape than self._data in the matrix case or some other odd case. Same thing above, where ndarray.view(self, ndarray) may be different from self.data, the latter may actually be a matrix type. I'd also like to know what the difference between self.data and self._data is, likewise self.mask and self._mask.

Dealing with masked arrays is like trying to sneak up on a tiger while creeping across eggshells. Disaster threatens at every move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charris - yes, it is weird, but it may be that this predates the following lines

    def _get_data(self):
        """Return the current data, as a view of the original
        underlying data.

        """
        return ndarray.view(self, self._baseclass)
    _data = property(fget=_get_data)
    data = property(fget=_get_data)

Similarly,

    def _get_mask(self):
        """Return the current mask.

        """
        # We could try to force a reshape, but that wouldn't work in some cases.
#        return self._mask.reshape(self.shape)
        return self._mask
    mask = property(fget=_get_mask, fset=__setmask__, doc="Mask")

So, this would seem to be safe... Or is there a MaskedMatrix class or so where parts get overwritten? My larger worry is that this was all written at a time when people were not as used to writing lots and lots of test cases...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the code actually postdates those functions you quote by about two years. Two different authors though, so we may not be the first struggling with the code ;) The mask part does look safe, although I'd like to know what the comment on line 2006 means.

I don't feel very confidence reviewing masked array code, there are a lot of subtleties that I don't understand yet...

_data = ndarray.view(self, ndarray.__getattribute__(self, '_baseclass'))
_mask = ndarray.__getattribute__(self, '_mask')
_data = self._data
_mask = self._mask
if isinstance(indx, basestring):
ndarray.__setitem__(_data, indx, value)
_data[indx] = value
if _mask is nomask:
self._mask = _mask = make_mask_none(self.shape, self.dtype)
_mask[indx] = getmask(value)
return
#........................................
_dtype = ndarray.__getattribute__(_data, 'dtype')
_dtype = _data.dtype
nbfields = len(_dtype.names or ())
#........................................
if value is masked:
Expand All @@ -3124,21 +3125,21 @@ def __setitem__(self, indx, value):
mval = tuple([False] * nbfields)
if _mask is nomask:
# Set the data, then the mask
ndarray.__setitem__(_data, indx, dval)
_data[indx] = dval
if mval is not nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
ndarray.__setitem__(_mask, indx, mval)
_mask[indx] = mval
elif not self._hardmask:
# Unshare the mask if necessary to avoid propagation
if not self._isfield:
self.unshare_mask()
_mask = ndarray.__getattribute__(self, '_mask')
_mask = self._mask
# Set the data, then the mask
ndarray.__setitem__(_data, indx, dval)
ndarray.__setitem__(_mask, indx, mval)
_data[indx] = dval
_mask[indx] = mval
elif hasattr(indx, 'dtype') and (indx.dtype == MaskType):
indx = indx * umath.logical_not(_mask)
ndarray.__setitem__(_data, indx, dval)
_data[indx] = dval
else:
if nbfields:
err_msg = "Flexible 'hard' masks are not yet supported..."
Expand All @@ -3149,7 +3150,7 @@ def __setitem__(self, indx, value):
np.copyto(dindx, dval, where=~mindx)
elif mindx is nomask:
dindx = dval
ndarray.__setitem__(_data, indx, dindx)
_data[indx] = dindx
_mask[indx] = mindx
return

Expand Down
43 changes: 43 additions & 0 deletions numpy/ma/tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,49 @@ def test_indexing(self):
assert_equal(s1, s2)
assert_(x1[1:1].shape == (0,))

def test_matrix_indexing(self):
# Tests conversions and indexing
x1 = np.matrix([[1, 2, 3], [4, 3, 2]])
x2 = array(x1, mask=[[1, 0, 0], [0, 1, 0]])
x3 = array(x1, mask=[[0, 1, 0], [1, 0, 0]])
x4 = array(x1)
# test conversion to strings
junk, garbage = str(x2), repr(x2)
# assert_equal(np.sort(x1), sort(x2, endwith=False))
# tests of indexing
assert_(type(x2[1, 0]) is type(x1[1, 0]))
assert_(x1[1, 0] == x2[1, 0])
assert_(x2[1, 1] is masked)
assert_equal(x1[0, 2], x2[0, 2])
assert_equal(x1[0, 1:], x2[0, 1:])
assert_equal(x1[:, 2], x2[:, 2])
assert_equal(x1[:], x2[:])
assert_equal(x1[1:], x3[1:])
x1[0, 2] = 9
x2[0, 2] = 9
assert_equal(x1, x2)
x1[0, 1:] = 99
x2[0, 1:] = 99
assert_equal(x1, x2)
x2[0, 1] = masked
assert_equal(x1, x2)
x2[0, 1:] = masked
assert_equal(x1, x2)
x2[0, :] = x1[0, :]
x2[0, 1] = masked
assert_(allequal(getmask(x2), np.array([[0, 1, 0], [0, 1, 0]])))
x3[1, :] = masked_array([1, 2, 3], [1, 1, 0])
assert_(allequal(getmask(x3)[1], array([1, 1, 0])))
assert_(allequal(getmask(x3[1]), array([1, 1, 0])))
x4[1, :] = masked_array([1, 2, 3], [1, 1, 0])
assert_(allequal(getmask(x4[1]), array([1, 1, 0])))
assert_(allequal(x4[1], array([1, 2, 3])))
x1 = np.matrix(np.arange(5) * 1.0)
x2 = masked_values(x1, 3.0)
assert_equal(x1, x2)
assert_(allequal(array([0, 0, 0, 1, 0], MaskType), x2.mask))
assert_equal(3.0, x2.fill_value)

def test_copy(self):
# Tests of some subtle points of copying and sizing.
n = [0, 0, 1, 0, 0]
Expand Down
93 changes: 88 additions & 5 deletions numpy/ma/tests/test_subclassing.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,20 +84,71 @@ def _get_series(self):

# also a subclass that overrides __str__, __repr__ and __setitem__, disallowing
# setting to non-class values (and thus np.ma.core.masked_print_option)
class CSAIterator(object):
"""
Flat iterator object that uses its own setter/getter
(works around ndarray.flat not propagating subclass setters/getters
see https://github.com/numpy/numpy/issues/4564)
roughly following MaskedIterator
"""
def __init__(self, a):
self._original = a
self._dataiter = a.view(np.ndarray).flat

def __iter__(self):
return self

def __getitem__(self, indx):
out = self._dataiter.__getitem__(indx)
if not isinstance(out, np.ndarray):
out = out.__array__()
out = out.view(type(self._original))
return out

def __setitem__(self, index, value):
self._dataiter[index] = self._original._validate_input(value)

def __next__(self):
return next(self._dataiter).__array__().view(type(self._original))

next = __next__


class ComplicatedSubArray(SubArray):

def __str__(self):
return 'myprefix {0} mypostfix'.format(
super(ComplicatedSubArray, self).__str__())
return 'myprefix {0} mypostfix'.format(self.view(SubArray))

def __repr__(self):
# Return a repr that does not start with 'name('
return '<{0} {1}>'.format(self.__class__.__name__, self)

def __setitem__(self, item, value):
# this ensures direct assignment to masked_print_option will fail
def _validate_input(self, value):
if not isinstance(value, ComplicatedSubArray):
raise ValueError("Can only set to MySubArray values")
super(ComplicatedSubArray, self).__setitem__(item, value)
return value

def __setitem__(self, item, value):
# validation ensures direct assignment with ndarray or
# masked_print_option will fail
super(ComplicatedSubArray, self).__setitem__(
item, self._validate_input(value))

def __getitem__(self, item):
# ensure getter returns our own class also for scalars
value = super(ComplicatedSubArray, self).__getitem__(item)
if not isinstance(value, np.ndarray): # scalar
value = value.__array__().view(ComplicatedSubArray)
return value

@property
def flat(self):
return CSAIterator(self)

@flat.setter
def flat(self, value):
y = self.ravel()
y[:] = value


class TestSubclassing(TestCase):
Expand Down Expand Up @@ -205,6 +256,38 @@ def test_subclasspreservation(self):
assert_equal(mxsub.info, xsub.info)
assert_equal(mxsub._mask, m)

def test_subclass_items(self):
"""test that getter and setter go via baseclass"""
x = np.arange(5)
xcsub = ComplicatedSubArray(x)
mxcsub = masked_array(xcsub, mask=[True, False, True, False, False])
# getter should return a ComplicatedSubArray, even for single item
# first check we wrote ComplicatedSubArray correctly
self.assertTrue(isinstance(xcsub[1], ComplicatedSubArray))
self.assertTrue(isinstance(xcsub[1:4], ComplicatedSubArray))
# now that it propagates inside the MaskedArray
self.assertTrue(isinstance(mxcsub[1], ComplicatedSubArray))
self.assertTrue(mxcsub[0] is masked)
self.assertTrue(isinstance(mxcsub[1:4].data, ComplicatedSubArray))
# also for flattened version (which goes via MaskedIterator)
self.assertTrue(isinstance(mxcsub.flat[1].data, ComplicatedSubArray))
self.assertTrue(mxcsub[0] is masked)
self.assertTrue(isinstance(mxcsub.flat[1:4].base, ComplicatedSubArray))

# setter should only work with ComplicatedSubArray input
# first check we wrote ComplicatedSubArray correctly
assert_raises(ValueError, xcsub.__setitem__, 1, x[4])
# now that it propagates inside the MaskedArray
assert_raises(ValueError, mxcsub.__setitem__, 1, x[4])
assert_raises(ValueError, mxcsub.__setitem__, slice(1, 4), x[1:4])
mxcsub[1] = xcsub[4]
mxcsub[1:4] = xcsub[1:4]
# also for flattened version (which goes via MaskedIterator)
assert_raises(ValueError, mxcsub.flat.__setitem__, 1, x[4])
assert_raises(ValueError, mxcsub.flat.__setitem__, slice(1, 4), x[1:4])
mxcsub.flat[1] = xcsub[4]
mxcsub.flat[1:4] = xcsub[1:4]

def test_subclass_repr(self):
"""test that repr uses the name of the subclass
and 'array' for np.ndarray"""
Expand Down