Skip to content

[WIP] Masking invalid x and/or weights in hist (#6483) #7133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

TrishGillett
Copy link
Contributor

@TrishGillett TrishGillett commented Sep 18, 2016

I'm having trouble testing locally, so this is a WIP PR so I can see how this code does in the CI.

  • Figure out if the AppVeyor error is a problem or a false alarm
  • Make an image comparison test to make sure the masking is working
  • Clean up the commit history

See #6483 and #6992.

@NelleV
Copy link
Member

NelleV commented Sep 18, 2016

Hi @TrishGillett
Thanks for your patch!
I think I know why you are having problem running the tests locally… Do you have a traceback available so that I can confirm?

@TrishGillett
Copy link
Contributor Author

TrishGillett commented Sep 18, 2016

Hi @NelleV, unfortunately there's no traceback to show, it's a matter of struggling to get the right environment setup. I wanted to manually try out some things, but I was having trouble making a conda environment which:
a) would install matplotlib from my local branch, and
b) would let me actually plot stuff.

Right when I got the former figured out, I hit trouble with the latter: I couldn't open a jupyter notebook from inside the env, and was having backend troubles trying to create plots from a console. If you have any advice, I'd be really grateful. For now I'm going to look at the Travis results and see what bugs it caught. :)

@tacaswell
Copy link
Member

My tactic for this is something along the lines of

conda create -n mpl_dev python=3.5 ipython jupyter matplotlib
source activate mpl_dev
conda remove matplotlib
cd path/to/mpl_source
pip install -v -e .

@TrishGillett
Copy link
Contributor Author

TrishGillett commented Sep 18, 2016

Works like a charm, @tacaswell! Thanks so much!

For this PR, here are some tasks that still need to be done:

  • Figure out if the AppVeyor error is a problem or a false alarm
  • Make an image comparison test to make sure the masking is working
  • Clean up the commit history

for i in range(len(x)):
mask_i = x[i].mask
if w[i] is not None:
mask_i = mask_i | w[i].mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an error if the weight matrix contained mask elements that weren't masked in the data x itself. Unless I am not seeing a use case for that?

Copy link
Contributor Author

@TrishGillett TrishGillett Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question of whether to mask invalid weights was raised on #6483 but not settled. I don't know of a use case offhand and am okay with throwing an error instead if the feeling is that this is overbuilding. @efiring Do you think there could be a use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using histograms with weights is reasonable way to do spatial integration of images, ex histogram(convert_pixel_locations_to_1D, bins, weights=image.ravel()). If you are looping over a bunch of images, some of which have artifacts that you need to mask out, being able to do this loop with out having to re-mask the pixel position values would be nice.

That said, this is something that should be pushed down into numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that explains the mask on x, not on the weights. I am also fine with a mask on the weights as long as it is identical as the one on x. But masks different on the weights and on x seem like it may be an error on the users side, hence throwing a ValueError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the other way around

xx = convent_to_1D(imgs.shape[:-1])
for im in imgs:
    m = compute_mask(im)
    w = ma.masked_array(im, m)
    ax.hist(m, bins, weight=w)

Basically, if we are going to allow masking on either, masking on both and taking the union seems like the best course.

Copy link
Member

@efiring efiring Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; combining independent masks on both via a union is not difficult, it is easy to explain, and it handles every situation that might arise.
See cbook.delete_masked_points; you could use this as soon as you have the list of sequences of values and the matching list of sequences of weights or of Nones. You would apply it to each matching sequence of values and its corresponding sequence of weights, or None. Maybe it's overkill, but it was designed for this sort of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiring Thanks for the cbook tip again, between that and a realization that masks seem to stack, I believe I can rewrite this piece as:

if not input_empty:
    for i in range(len(w)):
        if w[i] is not None:
            x[i] = np.ma.masked_array(x[i], mask=w[i].mask)
            w[i] = np.ma.masked_array(w[i], mask=x[i].mask)
    x = cbook.delete_masked_points(x)    
    w = cbook.delete_masked_points(w)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite what I had in mind...

xclean, wclean = [], []
for xi, wi in zip(x, w):
    xiclean, wiclean = cbook.delete_masked_points(xi, wi)
    xclean.append(xiclean)
    wclean.append(wiclean)

This works regardless of whether wi is None or a sequence the length of xi. Using cbook.delete_masked_points is only worthwhile if you are giving it more than one argument. Deleting the masked points from a single 1-D masked array can be done simply by calling it's compressed() method, so that's all you need in your version above. You would call w[i] = w[i].compressed() inside the inner if, and x[i] = x[i].compressed() immediately after it. There are many variations on the theme; I'm not sure which is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that's pretty nice, I didn't read the docs carefully enough to see the part about any points masked in one arg getting masked and deleted in the others. I'm not sure I understand what 'All input arguments that are not passed unchanged' means though...

if inp.ndim == 2:
# 2-D input with columns as datasets; switch to rows
inp = inp.T

if inp.shape[1] < inp.shape[0]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 👎 on this warning, I think it is assuming too much about how users will be using this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that this is just moving this warning around. I would be in favor of removing it, but if it is already in the code base, mostly ignore my previous comment.

Copy link
Contributor Author

@TrishGillett TrishGillett Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually just migrated up from where it was before (old line 6068). I don't care for it either but left it alone thinking it probably got debated when it was added. I'm more than happy to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I'm good either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absent a compelling reason I err on the side of not removing things (which may not be a fully defend able position, but it is a safe conservative one)

Chasing this back through git blame, the warning came into the code base in
214976f in 2008.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also in favor of removing it (I am biased as I am in a field where nsamples << nvariables )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow, that I can see being a problem.
Maybe throw an error (or raise a warning) if x.shape[0] == 1?

@tacaswell tacaswell added this to the 2.1 (next point release) milestone Sep 19, 2016
@NelleV
Copy link
Member

NelleV commented Sep 30, 2016

Hi @TrishGillett
Can remove the WIP in the title once this is ready to be reviewed and merged?

@TrishGillett
Copy link
Contributor Author

TrishGillett commented Sep 30, 2016

Sorry this has been on pause, the checklist here is still valid and ongoing. I just started a new job and haven't had time to make a VM with the right setup for image comparison tests, and I also haven't decided on a good minimal set of tests for this change (help welcome).

@NelleV
Copy link
Member

NelleV commented Sep 30, 2016

I took the liberty to move the check list to description of the comment. This way, github interprets it properly.

@tacaswell
Copy link
Member

@TrishGillett See http://matplotlib.org/devdocs/devel/testing.html#building-matplotlib-for-image-comparison-tests with the right env mpl will build it's own copy of the right version of freetype.

@tritemio
Copy link

tritemio commented Jun 6, 2017

Any plan for finalizing this PR? I think it is a pretty important fix. With the current bug in numpy.histogram using plt.hist is becoming quite cumbersome when input has NaNs.

@TrishGillett
Copy link
Contributor Author

Hey @tritemio, thanks for the report. I'll see if I can revive this.

@tacaswell tacaswell modified the milestones: 2.1 (next point release), 2.2 (next next feature release) Aug 29, 2017
@dstansby
Copy link
Member

dstansby commented Nov 1, 2017

I've been stumbling upon the issue fixed by this PR a lot recently! @TrishGillett do you mind if I pick this up and try and finish the pull request?

@TrishGillett
Copy link
Contributor Author

TrishGillett commented Nov 1, 2017

@dstansby You'd be more than welcome! I tried to pick this PR back up recently but found that it will need some nontrivial work to be brought up to date, made more difficult for me by my loss of context on this code over time. 😦

@klapo
Copy link

klapo commented Nov 6, 2017

@TrishGillett @dstansby I took a look through the test errors and it seems possible that this PR failed due to a flaw in the test itself that was cleaned up a couple weeks after you submitted the PR.

The test that failed in this PR was disabled due to a missing fonts issue. Below is the image diff from the test. Very clearly a font issue that should be unrelated to masking nans when plotting histograms.

Is it possible to try re-testing?

pgf_pdflatex_pdf-failed-diff

@jklymak
Copy link
Member

jklymak commented Nov 6, 2017

You'll need to follow the instructions at http://matplotlib.org/devel/gitwash/development_workflow.html Under "rewriting commit history" to 000000. Make a backup branch (Definitely do this!) 1. squash your commits, 2. rebase to master, 3. Force push back to origin...

@dstansby
Copy link
Member

dstansby commented Nov 6, 2017

In order to not mess up your branches @TrishGillett , I've made a new PR with exactly the same code here: #9706. Lets see what happens with CI!

@tacaswell tacaswell modified the milestones: needs sorting, v3.2.0 Mar 31, 2019
@tacaswell
Copy link
Member

Found this while seeing what #8638 was related to.

This should be revived, but I am not sure where things landed in light of #9706 being closed.

@dstansby dstansby modified the milestones: v3.2.0, v3.3.0 Aug 25, 2019
@QuLogic QuLogic modified the milestones: v3.3.0, v3.4.0 May 27, 2020
@jklymak jklymak added status: orphaned PR Good first issue Open a pull request against these issues if there are no active ones! labels Jul 14, 2020
@jklymak
Copy link
Member

jklymak commented Jul 14, 2020

This seemed mostly there. If anyone takes it up, please read the history carefully to make sure all opinions are taken into account.

@jklymak jklymak marked this pull request as draft July 23, 2020 16:46
@lucyleeow
Copy link
Contributor

Is there still interest in continuing this PR? From #9706 (review) it seems like there was discussion about how to deal with missing values but no link so not sure how that discussion evolved?

@jklymak
Copy link
Member

jklymak commented Oct 26, 2020

I don't remember the details, but you are welcome to take another try at it if you have a reasonable idea how to proceed. Note that some changes wrt NaN handling have already gone in....

@lucyleeow
Copy link
Contributor

Thanks @jklymak, looking at it more, I don't think I am familiar enough with the code base to tackle this.

@QuLogic QuLogic modified the milestones: v3.4.0, unassigned Jan 21, 2021
@jklymak
Copy link
Member

jklymak commented Jun 14, 2021

I'll closed as abandoned; thanks for the PR, and feel free to re-open if you want to come back to this!

@jklymak jklymak closed this Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue Open a pull request against these issues if there are no active ones! status: needs rebase status: orphaned PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.