Make hexbin mincnt consistently inclusive of lower bound #21381

RebeccaWPerry · 2021-10-19T11:28:15Z

PR Summary

This addresses #20877. I dug into the root cause and found that "mincnt" is documented as being exclusive, sometimes behaves as inclusive (default when you don't provide C), and sometimes behaves as exclusive (when you do provide C). In this PR, I make three small changes to standardize on mincnt being inclusive.

Prior to the fix:

If the user does not provide input C, then mincount is inclusive; mincount=1 displays hexagons with 1 or more points. I believe this is the expected behavior.
If the user does provide input C, then mincount is exclusive; mincount=1 displays hexagons with >1 points

After the fix:

mincount is treated as inclusive whether or not the user supplies C, docstring reflects this

Expected behavior is for both of these plots to be identical (adapted from example in #20877). The number of points reported by the print statements should be identical as well.:

import numpy as np
import matplotlib.pyplot as plt

npts = int(2e4)

# Create a random distribution and another one as it perturbation
y = np.random.randint(0, 100, npts)
# Y axis variable
x = np.random.normal(loc=0, scale=2, size=npts)

# Plot
fig, axes = plt.subplots(2,1)

#C not specified
ax = axes[0]
res1 = ax.hexbin(x, y, mincnt=1, extent=[-8,8,0,100], gridsize=150)

#adding weights of 1 and specifying reduce as sum should be identical to no weights
ax = axes[1]
res2 = ax.hexbin(x, y, C=np.ones(len(x)), mincnt=1, extent=[-8,8,0,100], gridsize=150,
                 reduce_C_function = np.sum)

c1 = res1.get_array()                
c2 = res2.get_array()

print('points binned in plot1: ', sum(c1))
print('points binned in plot2: ', sum(c2))

fig.savefig(f"hexbin_weights.png")
plt.show()

PR Checklist

Has pytest style unit tests (and pytest passes).
Is Flake 8 compliant (run flake8 on changed files to check).
[N/A] New features are documented, with examples if plot related.
Documentation is sphinx and numpydoc compliant (the docs should build without error).
Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
[ N/A] New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
[ N/A] API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

jklymak · 2021-10-19T13:16:33Z

Hi @RebeccaWPerry, maybe I'm mistaken, but I think the problem in #20877 is that if c is provided, each element in a hexagon can count for less than 1. This has nothing to do with the less than being inclusive or not.

RebeccaWPerry · 2021-10-19T14:12:39Z

@jklymak it was super interesting to look into! Check out the example I added -- where C is all ones. Surely if I'm "scaling" by 1 there should be no difference?

if C is none: hexagons with less than mincount are set to nan (code)

if C is provided: hexagons with more than mincount are set to the value (code)

And I found no evidence that the scaling of C should impact the count -- count is simply count as implemented and as the bug reporter expected.

QuLogic · 2021-10-19T23:12:05Z

I don't know about the scaling, but the condition on mincnt is inconsistent, and scaling doesn't make a difference there.

This would need a test though.

jklymak · 2021-10-20T06:54:55Z

And I found no evidence that the scaling of C should impact the count -- count is simply count as implemented and as the bug reporter expected.

Agreed, I originally misdiagnosed the problem.

This change is fine, but as @QuLogic suggests a test would be nice. You can probably just np.testing.assert_allclose(c1, c2) at the end of your code above and add where hex bin is tested (probably just in test_axes.py.

RebeccaWPerry · 2021-10-20T11:08:54Z

Sounds good - I'll add a test!

tacaswell · 2021-10-20T20:29:54Z

This PR is affected by a re-writing of our history to remove a large number of accidentally committed files see discourse for details.

To recover this PR it will need be rebased onto the new default branch (main). There are several ways to accomplish this, but we recommend (assuming that you call the matplotlib/matplotlib remote "upstream"

git remote update
git checkout main
git merge --ff-only upstream/main
git checkout YOUR_BRANCH
git rebase --onto=main upstream/old_master
# git rebase -i main # if you prefer
git push --force-with-lease   # assuming you are tracking your branch

If you do not feel comfortable doing this or need any help please reach out to any of the Matplotlib developers. We can either help you with the process or do it for you.

Thank you for your contributions to Matplotlib and sorry for the inconvenience.

RebeccaWPerry · 2021-10-22T15:21:12Z

I think I have addressed the feedback -- added a test and rebased to main. Please let me know if there is anything else to address!

anntzer · 2021-10-25T14:56:20Z

While I agree with the change (which probably needs an API change note), there is an additional issue to consider (#16865 (comment)): with this PR, hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000)) (for example) now emits "RuntimeWarning: Mean of empty slice." and "RuntimeWarning: invalid value encountered in double_scalars", because the default reduce_C_function (np.mean()) is getting called on empty bins as well. At least I think the default reduce_C_function should not emit the warning (it should be a small wrapper around np.mean, that directly returns nan with no warning for empty inputs). Perhaps the doc for reduce_C_function should also mention that the reducer needs to be able to handle empty inputs.

anntzer

Per #21381 (comment); anyone can dismiss once handled.

timhoffm · 2021-10-25T20:35:01Z

A quick thought (I've not followed the situation in detail, so might be wrong):

Perhaps the doc for reduce_C_function should also mention that the reducer needs to be able to handle empty inputs.

Or should we wrap every reduce_C_function that is passed. That'd take the burden from the user and will allow passing many existing functions as is. The only way this could fall short is if the answer to empty inputs should not be NaN.

anntzer · 2021-10-25T21:09:30Z

But I think(?) the problem also comes from the fact that the default reducer is actually a bit weird; e.g. I would rather have expected it to be sum() rather than mean(). For example

hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000), reduce_C_function=np.sum)

likely makes more sense (I know, that's pretty vague...) than

hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000))

and sum() is just fine with empty inputs (and I'd rather have sum() return 0 rather than nan).

(Not that the current situation is really better, I guess... but my point is that auto-wrapping is not a complete solution either.)

jklymak · 2021-10-26T07:08:19Z

Yes, I agree, this probably has to be fixed before this can go in.

fundamental issue with accumulation for 0-sized bins

anntzer · 2021-10-26T07:30:53Z

re-posting my message on #16865 (let's keep the discussion here instead)

Briefly looking at this again, I think that

when C is unset (plain counts), it makes sense to default to <0 (i.e., keep everything): this ensures that empty bins are reported as zero and not nan, which can be helpful e.g. for interactive cursoring.

when C is set, OTOH, it makes sense to skip empty bins and set them to nan, as one cannot compute means on them anyways.

So mincnt has two slightly different semantics in both cases. Perhaps yet another solution would be to note that mincnt is actually not so useful in the case of C being unset: in that case, the empty bins can just be reported as zero, and if you want to mask them out you can do so using standard colormapping tools (e.g., vmin=1 and cmap=mpl.cm.get_cmap().with_extremes(under="none")). So perhaps one should just deprecate the use of mincnt together with C unset? I'm not really sure what the history behind mincnt is... (unfortunately the git log is not helpful here)

RebeccaWPerry · 2021-10-30T15:41:56Z

I'm parsing my way through the various options here - in response to this:

and sum() is just fine with empty inputs (and I'd rather have sum() return 0 rather than nan).

I think it's important for sum() to return nan when where are no points in a cell because no points is different than having points that sum to 0 (either all points are 0, or there are negative and positive values that balance each other out).

RebeccaWPerry · 2021-10-30T17:30:55Z

@anntzer thanks for flagging this interconnected issue. Please bear with me since this is my first PR that has gotten this far.
Here are the use cases under consideration and current behavior:
Weights C NOT specified

mincount not specified --> returns 0's for cells with 0 points
mincount specified --> output array has no values for cells
with < mincnt points (output array is actually smaller, not full of nan's)

Weights C ARE specified

mincount not specified --> mincnt=0, mean throws warning, only returns values
where aggregator function could be computed (output array is smaller than total cells, not full of nan's)
mincount specified --> array has no values for cells with < mincnt points, still throws
warning for mean if mincnt=0

Here are the paths forward I can see in the thread above and from thinking about it some more:

Change default to mincount=1. Change docstring to mincount >= 0 to show that a user can specify 0 to include cells with 0 counts (or for other reduction functions that seamlessly handle 0 points)
Change default min aggregator function to sum() so that the default behavior does not throw warnings. If someone switches to mean() they would have to navigate the warning message/specifiy mincnt=1 to avoid warnings. Could add something to that effect in the doc string?
In the code, check if reduction function handles 0 pts gracefully or not and set mincnt to 0 or 1 accordingly (if not explicitly set). This seems like confusing magic under the hood, so I would probably not recommend it.
Wrapper around np.mean that returns nan without warning. Mention in doc that reducer needs to be able to handle 0 points.

Am I missing any viable options? Any recommendations on which path forward to choose? I like 1 or 2, but I'm not sure how to weigh the importance of maintaining the same default behavior of a function from one code release to the next.

Thanks in advance for the guidance! I'm happy to make the changes if I can get a clear sense of what the best direction to take this in is!

anntzer · 2021-10-31T19:45:21Z

You ran into a fairly complicated issue to start with, and you are doing well in keeping all relevant points in your head :-)

I agree with your description of the current behavior. We also try very hard not to change behavior, especially if there's no deprecation warning in between (silently changing the plots of end users is not nice). Here, I think the agreement is that we want mincnt to behave consistently regardless of whether C is given, so there will need to be a backcompat break at some point, and therefore a deprecation warning. I also agree that the magic of (3) is not really a good option (we can't know what warnings we need to suppress, and what warnings are legitimate ones that need to be propagated out).

In my view, defaulting reduce_C_function to mean() doesn't make sense, just like hist2d defaults to summing weights that fall into a bin, rather than averaging them. Given that we're already going to have a deprecation period for the change in mincnt semantics, we may just as well do both of them at once, and both switch to sum() and to inclusive mincnt. (This is something where I'd like other devs to weigh in, in particular @dstansby, who has recently played with hexbin().) Therefore, I guess the deprecation would be something like (unchecked...)

_unset = object()
def hexbin(..., mincnt=_unset, reduce_C_function=_unset):
    ...
    if C is None:  # old code is fine; normalize _unset to zero
    if C is not None:
        if reduce_C_function is _unset:
            # normalize to mean-wrapped-with-zero-handling, but warn that this
            # will use sum() in the future instead.
        # do as previously
        # if mincnt is set and any bin has exactly mincnt points, warn that the
        # semantics will change, but note that one can set mincnt=-1 to "easily"
        # suppress the warning

and in the future (after the deprecation period), even though the signature will now be

def hexbin(..., mincnt=0, reduce_C_function=np.sum):

still add a special-case for reduce_C_function is np.mean to wrap it with warning-suppression.

jklymak · 2021-11-22T09:48:13Z

I don't think its as complicated as all that. Currently if C is set, and a bin is empty, the bin is set to NaN. The suggested change would simply be:

if len(vals) >= mincnt and len(vals) > 0:
       lattice1[i, j] = reduce_C_function(vals)
else:
       lattice1[i, j] = np.NaN

That doesn't require a deprecation, and is consistent with the docs. It does require an API note, since it is fixing a bug?

jklymak · 2022-02-01T12:03:32Z

Popping to draft until the review comments can be dealt with. Feel free to mark active when ready...

rcomer · 2023-12-12T19:04:17Z

Hi @RebeccaWPerry, thank you for all your work digging into this. It looks like the issue you were targeting has recently been fixed by another pull request (by basically the same approach). The fix was #26113, with some follow up in #27179 - a pretty gnarly problem!

So I'm going to close this one.

QuLogic added the status: needs tests label Oct 19, 2021

RebeccaWPerry mentioned this pull request Oct 20, 2021

Possible issue in hexbin using weights #20877

Closed

jklymak added this to the v3.6.0 milestone Oct 20, 2021

tacaswell force-pushed the main branch from f6e0ee4 to a1eef38 Compare October 20, 2021 19:55

RebeccaWPerry added 2 commits October 22, 2021 10:06

BF - adding fix for hexbin mincnt

bc6bccc

doc - clarifying that mincount is inclusive

bd24a0a

RebeccaWPerry force-pushed the hexbinmin branch from 0ea4bd9 to bd24a0a Compare October 22, 2021 14:07

adding hexbin C weights/mincnt test

efa7e60

jklymak added status: needs review and removed status: needs tests labels Oct 25, 2021

jklymak previously approved these changes Oct 25, 2021

View reviewed changes

anntzer requested changes Oct 25, 2021

View reviewed changes

jklymak self-requested a review October 26, 2021 07:08

jklymak added the status: needs revision label Feb 1, 2022

jklymak marked this pull request as draft February 1, 2022 12:03

QuLogic modified the milestones: v3.6.0, v3.7.0 Jul 5, 2022

github-actions bot added the status: needs rebase label Oct 18, 2022

tacaswell modified the milestones: v3.7.0, v3.8.0 Jan 11, 2023

anntzer mentioned this pull request Jul 14, 2023

Fixes #12926 - inconsistency upon passing C in hexbin #26113

Merged

3 tasks

ksunden modified the milestones: v3.8.0, future releases Aug 8, 2023

tacaswell mentioned this pull request Oct 16, 2023

[Bug]: hexbin cannot always accept np.max like functions as reduce_C_function #27103

Closed

rcomer closed this Dec 12, 2023

rcomer added the status: superseded label Dec 12, 2023

Uh oh!

Make hexbin mincnt consistently inclusive of lower bound #21381

Make hexbin mincnt consistently inclusive of lower bound #21381

Uh oh!

Conversation

RebeccaWPerry commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

jklymak commented Oct 19, 2021

Uh oh!

RebeccaWPerry commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuLogic commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Oct 20, 2021

Uh oh!

RebeccaWPerry commented Oct 20, 2021

Uh oh!

tacaswell commented Oct 20, 2021

Uh oh!

RebeccaWPerry commented Oct 22, 2021

Uh oh!

anntzer commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer left a comment

Choose a reason for hiding this comment

Uh oh!

timhoffm commented Oct 25, 2021

Uh oh!

anntzer commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Oct 26, 2021

Uh oh!

anntzer commented Oct 26, 2021

Uh oh!

RebeccaWPerry commented Oct 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RebeccaWPerry commented Oct 30, 2021

Uh oh!

anntzer commented Oct 31, 2021

Uh oh!

jklymak commented Nov 22, 2021

Uh oh!

jklymak commented Feb 1, 2022

Uh oh!

rcomer commented Dec 12, 2023

Uh oh!

Uh oh!

RebeccaWPerry commented Oct 19, 2021 •

edited

Loading

RebeccaWPerry commented Oct 19, 2021 •

edited

Loading

QuLogic commented Oct 19, 2021 •

edited

Loading

anntzer commented Oct 25, 2021 •

edited

Loading

anntzer commented Oct 25, 2021 •

edited

Loading

RebeccaWPerry commented Oct 30, 2021 •

edited

Loading