Skip to content

[WIP] Masking invalid x and/or weights in hist (#6483) #7133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 38 additions & 23 deletions lib/matplotlib/axes/_axes.py
Original file line number Diff line number Diff line change
Expand Up @@ -5871,7 +5871,7 @@ def hist(self, x, bins=None, range=None, normed=False, weights=None,
Parameters
----------
x : (n,) array or sequence of (n,) arrays
Input values, this takes either a single array or a sequency of
Input values, this takes either a single array or a sequence of
arrays which are not required to be of the same length

bins : integer or array_like or 'auto', optional
Expand Down Expand Up @@ -6042,8 +6042,8 @@ def hist(self, x, bins=None, range=None, normed=False, weights=None,

"""
def _normalize_input(inp, ename='input'):
"""Normalize 1 or 2d input into list of np.ndarray or
a single 2D np.ndarray.
"""Normalize 1 or 2d input into a list of one or more np.ndarrays
which have potentially different lengths.

Parameters
----------
Expand All @@ -6054,25 +6054,30 @@ def _normalize_input(inp, ename='input'):
"""
if (isinstance(x, np.ndarray) or
not iterable(cbook.safe_first_element(inp))):
# TODO: support masked arrays;
inp = np.asarray(inp)

inp = cbook.safe_masked_invalid(inp)
if inp.ndim == 2:
# 2-D input with columns as datasets; switch to rows
inp = inp.T

if inp.shape[1] < inp.shape[0]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 👎 on this warning, I think it is assuming too much about how users will be using this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that this is just moving this warning around. I would be in favor of removing it, but if it is already in the code base, mostly ignore my previous comment.

Copy link
Contributor Author

@TrishGillett TrishGillett Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually just migrated up from where it was before (old line 6068). I don't care for it either but left it alone thinking it probably got debated when it was added. I'm more than happy to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I'm good either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absent a compelling reason I err on the side of not removing things (which may not be a fully defend able position, but it is a safe conservative one)

Chasing this back through git blame, the warning came into the code base in
214976f in 2008.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also in favor of removing it (I am biased as I am in a field where nsamples << nvariables )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow, that I can see being a problem.
Maybe throw an error (or raise a warning) if x.shape[0] == 1?

warnings.warn(
'2D hist input should be nsamples x nvariables;\n '
'this looks transposed '
'(shape is %d x %d)' % inp.shape[::-1])

# Change to a list of arrays
inp = [inp[i, :] for i in range(inp.shape[0])]

elif inp.ndim == 1:
# new view, single row
inp = inp.reshape(1, inp.shape[0])
# Change to a list with a single array
inp = [inp.reshape(1, inp.shape[0])]
else:
raise ValueError(
"{ename} must be 1D or 2D".format(ename=ename))
if inp.shape[1] < inp.shape[0]:
warnings.warn(
'2D hist input should be nsamples x nvariables;\n '
'this looks transposed '
'(shape is %d x %d)' % inp.shape[::-1])
else:
# multiple hist with data of different length
inp = [np.asarray(xi) for xi in inp]
# Change to a list of arrays
inp = [cbook.safe_masked_invalid(arr) for arr in inp]

return inp

Expand Down Expand Up @@ -6117,23 +6122,23 @@ def _normalize_input(inp, ename='input'):
binsgiven = (cbook.iterable(bins) or bin_range is not None)

# basic input validation
flat = np.ravel(x)

input_empty = len(flat) == 0
input_empty = len(np.ravel(x)) == 0

# Massage 'x' for processing.
# Massage shape of 'x' to be a list of arrays
if input_empty:
x = np.array([[]])
x = [np.array([[]])]
else:
x = _normalize_input(x, 'x')
nx = len(x) # number of datasets

# We need to do to 'weights' what was done to 'x'
if weights is not None:
w = _normalize_input(weights, 'weights')
else:
# Massage shape of 'weights' to be a list, where each element
# weights[i] is either None or an array with the same shape as x[i]
if weights is None:
w = [None]*nx
else:
w = _normalize_input(weights, 'weights')

# Comparing shape of weights vs. x
if len(w) != nx:
raise ValueError('weights should have the same shape as x')

Expand All @@ -6142,6 +6147,16 @@ def _normalize_input(inp, ename='input'):
raise ValueError(
'weights should have the same shape as x')

# Combine the masks from x[i] and w[i] (if applicable) into a single
# mask and apply it to both.
if not input_empty:
for i in range(len(x)):
mask_i = x[i].mask
if w[i] is not None:
mask_i = mask_i | w[i].mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an error if the weight matrix contained mask elements that weren't masked in the data x itself. Unless I am not seeing a use case for that?

Copy link
Contributor Author

@TrishGillett TrishGillett Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question of whether to mask invalid weights was raised on #6483 but not settled. I don't know of a use case offhand and am okay with throwing an error instead if the feeling is that this is overbuilding. @efiring Do you think there could be a use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using histograms with weights is reasonable way to do spatial integration of images, ex histogram(convert_pixel_locations_to_1D, bins, weights=image.ravel()). If you are looping over a bunch of images, some of which have artifacts that you need to mask out, being able to do this loop with out having to re-mask the pixel position values would be nice.

That said, this is something that should be pushed down into numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that explains the mask on x, not on the weights. I am also fine with a mask on the weights as long as it is identical as the one on x. But masks different on the weights and on x seem like it may be an error on the users side, hence throwing a ValueError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the other way around

xx = convent_to_1D(imgs.shape[:-1])
for im in imgs:
    m = compute_mask(im)
    w = ma.masked_array(im, m)
    ax.hist(m, bins, weight=w)

Basically, if we are going to allow masking on either, masking on both and taking the union seems like the best course.

Copy link
Member

@efiring efiring Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; combining independent masks on both via a union is not difficult, it is easy to explain, and it handles every situation that might arise.
See cbook.delete_masked_points; you could use this as soon as you have the list of sequences of values and the matching list of sequences of weights or of Nones. You would apply it to each matching sequence of values and its corresponding sequence of weights, or None. Maybe it's overkill, but it was designed for this sort of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiring Thanks for the cbook tip again, between that and a realization that masks seem to stack, I believe I can rewrite this piece as:

if not input_empty:
    for i in range(len(w)):
        if w[i] is not None:
            x[i] = np.ma.masked_array(x[i], mask=w[i].mask)
            w[i] = np.ma.masked_array(w[i], mask=x[i].mask)
    x = cbook.delete_masked_points(x)    
    w = cbook.delete_masked_points(w)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite what I had in mind...

xclean, wclean = [], []
for xi, wi in zip(x, w):
    xiclean, wiclean = cbook.delete_masked_points(xi, wi)
    xclean.append(xiclean)
    wclean.append(wiclean)

This works regardless of whether wi is None or a sequence the length of xi. Using cbook.delete_masked_points is only worthwhile if you are giving it more than one argument. Deleting the masked points from a single 1-D masked array can be done simply by calling it's compressed() method, so that's all you need in your version above. You would call w[i] = w[i].compressed() inside the inner if, and x[i] = x[i].compressed() immediately after it. There are many variations on the theme; I'm not sure which is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that's pretty nice, I didn't read the docs carefully enough to see the part about any points masked in one arg getting masked and deleted in the others. I'm not sure I understand what 'All input arguments that are not passed unchanged' means though...

w[i] = np.ma.masked_array(w[i], mask=mask_i).compressed()
x[i] = np.ma.masked_array(x[i], mask=mask_i).compressed()

if color is None:
color = [self._get_lines.get_next_color() for i in xrange(nx)]
else:
Expand Down