-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Make hexbin mincnt consistently inclusive of lower bound #21381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @RebeccaWPerry, maybe I'm mistaken, but I think the problem in #20877 is that if |
@jklymak it was super interesting to look into! Check out the example I added -- where C is all ones. Surely if I'm "scaling" by 1 there should be no difference? if C is none: hexagons with less than mincount are set to nan (code) if C is provided: hexagons with more than mincount are set to the value (code) And I found no evidence that the scaling of C should impact the count -- count is simply count as implemented and as the bug reporter expected. |
I don't know about the scaling, but the condition on This would need a test though. |
Agreed, I originally misdiagnosed the problem. This change is fine, but as @QuLogic suggests a test would be nice. You can probably just |
Sounds good - I'll add a test! |
This PR is affected by a re-writing of our history to remove a large number of accidentally committed files see discourse for details. To recover this PR it will need be rebased onto the new default branch (main). There are several ways to accomplish this, but we recommend (assuming that you call the matplotlib/matplotlib remote git remote update
git checkout main
git merge --ff-only upstream/main
git checkout YOUR_BRANCH
git rebase --onto=main upstream/old_master
# git rebase -i main # if you prefer
git push --force-with-lease # assuming you are tracking your branch If you do not feel comfortable doing this or need any help please reach out to any of the Matplotlib developers. We can either help you with the process or do it for you. Thank you for your contributions to Matplotlib and sorry for the inconvenience. |
0ea4bd9
to
bd24a0a
Compare
I think I have addressed the feedback -- added a test and rebased to main. Please let me know if there is anything else to address! |
While I agree with the change (which probably needs an API change note), there is an additional issue to consider (#16865 (comment)): with this PR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per #21381 (comment); anyone can dismiss once handled.
A quick thought (I've not followed the situation in detail, so might be wrong):
Or should we wrap every |
But I think(?) the problem also comes from the fact that the default reducer is actually a bit weird; e.g. I would rather have expected it to be sum() rather than mean(). For example
likely makes more sense (I know, that's pretty vague...) than
and sum() is just fine with empty inputs (and I'd rather have sum() return 0 rather than nan). (Not that the current situation is really better, I guess... but my point is that auto-wrapping is not a complete solution either.) |
Yes, I agree, this probably has to be fixed before this can go in. |
fundamental issue with accumulation for 0-sized bins
re-posting my message on #16865 (let's keep the discussion here instead)
|
I'm parsing my way through the various options here - in response to this:
I think it's important for sum() to return nan when where are no points in a cell because no points is different than having points that sum to 0 (either all points are 0, or there are negative and positive values that balance each other out). |
@anntzer thanks for flagging this interconnected issue. Please bear with me since this is my first PR that has gotten this far.
Weights C ARE specified
Here are the paths forward I can see in the thread above and from thinking about it some more:
Am I missing any viable options? Any recommendations on which path forward to choose? I like 1 or 2, but I'm not sure how to weigh the importance of maintaining the same default behavior of a function from one code release to the next. Thanks in advance for the guidance! I'm happy to make the changes if I can get a clear sense of what the best direction to take this in is! |
You ran into a fairly complicated issue to start with, and you are doing well in keeping all relevant points in your head :-) I agree with your description of the current behavior. We also try very hard not to change behavior, especially if there's no deprecation warning in between (silently changing the plots of end users is not nice). Here, I think the agreement is that we want mincnt to behave consistently regardless of whether C is given, so there will need to be a backcompat break at some point, and therefore a deprecation warning. I also agree that the magic of (3) is not really a good option (we can't know what warnings we need to suppress, and what warnings are legitimate ones that need to be propagated out). In my view, defaulting reduce_C_function to mean() doesn't make sense, just like hist2d defaults to summing weights that fall into a bin, rather than averaging them. Given that we're already going to have a deprecation period for the change in mincnt semantics, we may just as well do both of them at once, and both switch to sum() and to inclusive mincnt. (This is something where I'd like other devs to weigh in, in particular @dstansby, who has recently played with hexbin().) Therefore, I guess the deprecation would be something like (unchecked...)
and in the future (after the deprecation period), even though the signature will now be
still add a special-case for |
I don't think its as complicated as all that. Currently if C is set, and a bin is empty, the bin is set to NaN. The suggested change would simply be: if len(vals) >= mincnt and len(vals) > 0:
lattice1[i, j] = reduce_C_function(vals)
else:
lattice1[i, j] = np.NaN That doesn't require a deprecation, and is consistent with the docs. It does require an API note, since it is fixing a bug? |
Popping to draft until the review comments can be dealt with. Feel free to mark active when ready... |
Hi @RebeccaWPerry, thank you for all your work digging into this. It looks like the issue you were targeting has recently been fixed by another pull request (by basically the same approach). The fix was #26113, with some follow up in #27179 - a pretty gnarly problem! So I'm going to close this one. |
PR Summary
This addresses #20877. I dug into the root cause and found that "mincnt" is documented as being exclusive, sometimes behaves as inclusive (default when you don't provide C), and sometimes behaves as exclusive (when you do provide C). In this PR, I make three small changes to standardize on mincnt being inclusive.
Prior to the fix:
After the fix:
Expected behavior is for both of these plots to be identical (adapted from example in #20877). The number of points reported by the print statements should be identical as well.:
PR Checklist
pytest
passes).flake8
on changed files to check).flake8-docstrings
and runflake8 --docstring-convention=all
).doc/users/next_whats_new/
(follow instructions in README.rst there).doc/api/next_api_changes/
(follow instructions in README.rst there).