Skip to content

Make hexbin mincnt consistently inclusive of lower bound #21381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

RebeccaWPerry
Copy link
Contributor

@RebeccaWPerry RebeccaWPerry commented Oct 19, 2021

PR Summary

This addresses #20877. I dug into the root cause and found that "mincnt" is documented as being exclusive, sometimes behaves as inclusive (default when you don't provide C), and sometimes behaves as exclusive (when you do provide C). In this PR, I make three small changes to standardize on mincnt being inclusive.

Prior to the fix:

  • If the user does not provide input C, then mincount is inclusive; mincount=1 displays hexagons with 1 or more points. I believe this is the expected behavior.
  • If the user does provide input C, then mincount is exclusive; mincount=1 displays hexagons with >1 points

After the fix:

  • mincount is treated as inclusive whether or not the user supplies C, docstring reflects this

Expected behavior is for both of these plots to be identical (adapted from example in #20877). The number of points reported by the print statements should be identical as well.:

import numpy as np
import matplotlib.pyplot as plt

npts = int(2e4)

# Create a random distribution and another one as it perturbation
y = np.random.randint(0, 100, npts)
# Y axis variable
x = np.random.normal(loc=0, scale=2, size=npts)

# Plot
fig, axes = plt.subplots(2,1)

#C not specified
ax = axes[0]
res1 = ax.hexbin(x, y, mincnt=1, extent=[-8,8,0,100], gridsize=150)

#adding weights of 1 and specifying reduce as sum should be identical to no weights
ax = axes[1]
res2 = ax.hexbin(x, y, C=np.ones(len(x)), mincnt=1, extent=[-8,8,0,100], gridsize=150,
                 reduce_C_function = np.sum)

c1 = res1.get_array()                
c2 = res2.get_array()

print('points binned in plot1: ', sum(c1))
print('points binned in plot2: ', sum(c2))

fig.savefig(f"hexbin_weights.png")
plt.show()

PR Checklist

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (run flake8 on changed files to check).
  • [N/A] New features are documented, with examples if plot related.
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
  • [ N/A] New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • [ N/A] API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

@jklymak
Copy link
Member

jklymak commented Oct 19, 2021

Hi @RebeccaWPerry, maybe I'm mistaken, but I think the problem in #20877 is that if c is provided, each element in a hexagon can count for less than 1. This has nothing to do with the less than being inclusive or not.

@RebeccaWPerry
Copy link
Contributor Author

RebeccaWPerry commented Oct 19, 2021

@jklymak it was super interesting to look into! Check out the example I added -- where C is all ones. Surely if I'm "scaling" by 1 there should be no difference?

if C is none: hexagons with less than mincount are set to nan (code)

if C is provided: hexagons with more than mincount are set to the value (code)

And I found no evidence that the scaling of C should impact the count -- count is simply count as implemented and as the bug reporter expected.

@QuLogic
Copy link
Member

QuLogic commented Oct 19, 2021

I don't know about the scaling, but the condition on mincnt is inconsistent, and scaling doesn't make a difference there.

This would need a test though.

@jklymak
Copy link
Member

jklymak commented Oct 20, 2021

And I found no evidence that the scaling of C should impact the count -- count is simply count as implemented and as the bug reporter expected.

Agreed, I originally misdiagnosed the problem.

This change is fine, but as @QuLogic suggests a test would be nice. You can probably just np.testing.assert_allclose(c1, c2) at the end of your code above and add where hex bin is tested (probably just in test_axes.py.

@jklymak jklymak added this to the v3.6.0 milestone Oct 20, 2021
@RebeccaWPerry
Copy link
Contributor Author

Sounds good - I'll add a test!

@tacaswell
Copy link
Member

This PR is affected by a re-writing of our history to remove a large number of accidentally committed files see discourse for details.

To recover this PR it will need be rebased onto the new default branch (main). There are several ways to accomplish this, but we recommend (assuming that you call the matplotlib/matplotlib remote "upstream"

git remote update
git checkout main
git merge --ff-only upstream/main
git checkout YOUR_BRANCH
git rebase --onto=main upstream/old_master
# git rebase -i main # if you prefer
git push --force-with-lease   # assuming you are tracking your branch

If you do not feel comfortable doing this or need any help please reach out to any of the Matplotlib developers. We can either help you with the process or do it for you.

Thank you for your contributions to Matplotlib and sorry for the inconvenience.

@RebeccaWPerry
Copy link
Contributor Author

I think I have addressed the feedback -- added a test and rebased to main. Please let me know if there is anything else to address!

jklymak
jklymak previously approved these changes Oct 25, 2021
@anntzer
Copy link
Contributor

anntzer commented Oct 25, 2021

While I agree with the change (which probably needs an API change note), there is an additional issue to consider (#16865 (comment)): with this PR, hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000)) (for example) now emits "RuntimeWarning: Mean of empty slice." and "RuntimeWarning: invalid value encountered in double_scalars", because the default reduce_C_function (np.mean()) is getting called on empty bins as well. At least I think the default reduce_C_function should not emit the warning (it should be a small wrapper around np.mean, that directly returns nan with no warning for empty inputs). Perhaps the doc for reduce_C_function should also mention that the reducer needs to be able to handle empty inputs.

Copy link
Contributor

@anntzer anntzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per #21381 (comment); anyone can dismiss once handled.

@timhoffm
Copy link
Member

A quick thought (I've not followed the situation in detail, so might be wrong):

Perhaps the doc for reduce_C_function should also mention that the reducer needs to be able to handle empty inputs.

Or should we wrap every reduce_C_function that is passed. That'd take the burden from the user and will allow passing many existing functions as is. The only way this could fall short is if the answer to empty inputs should not be NaN.

@anntzer
Copy link
Contributor

anntzer commented Oct 25, 2021

But I think(?) the problem also comes from the fact that the default reducer is actually a bit weird; e.g. I would rather have expected it to be sum() rather than mean(). For example

hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000), reduce_C_function=np.sum)

likely makes more sense (I know, that's pretty vague...) than

hexbin(np.random.randn(10000), np.random.randn(10000), C=np.random.rand(10000))

and sum() is just fine with empty inputs (and I'd rather have sum() return 0 rather than nan).

(Not that the current situation is really better, I guess... but my point is that auto-wrapping is not a complete solution either.)

@jklymak
Copy link
Member

jklymak commented Oct 26, 2021

Yes, I agree, this probably has to be fixed before this can go in.

@jklymak jklymak self-requested a review October 26, 2021 07:08
@jklymak jklymak dismissed their stale review October 26, 2021 07:09

fundamental issue with accumulation for 0-sized bins

@anntzer
Copy link
Contributor

anntzer commented Oct 26, 2021

re-posting my message on #16865 (let's keep the discussion here instead)

Briefly looking at this again, I think that

  • when C is unset (plain counts), it makes sense to default to <0 (i.e., keep everything): this ensures that empty bins are reported as zero and not nan, which can be helpful e.g. for interactive cursoring.
  • when C is set, OTOH, it makes sense to skip empty bins and set them to nan, as one cannot compute means on them anyways.

So mincnt has two slightly different semantics in both cases. Perhaps yet another solution would be to note that mincnt is actually not so useful in the case of C being unset: in that case, the empty bins can just be reported as zero, and if you want to mask them out you can do so using standard colormapping tools (e.g., vmin=1 and cmap=mpl.cm.get_cmap().with_extremes(under="none")). So perhaps one should just deprecate the use of mincnt together with C unset? I'm not really sure what the history behind mincnt is... (unfortunately the git log is not helpful here)

@RebeccaWPerry
Copy link
Contributor Author

RebeccaWPerry commented Oct 30, 2021

I'm parsing my way through the various options here - in response to this:

and sum() is just fine with empty inputs (and I'd rather have sum() return 0 rather than nan).

I think it's important for sum() to return nan when where are no points in a cell because no points is different than having points that sum to 0 (either all points are 0, or there are negative and positive values that balance each other out).

@RebeccaWPerry
Copy link
Contributor Author

@anntzer thanks for flagging this interconnected issue. Please bear with me since this is my first PR that has gotten this far.
Here are the use cases under consideration and current behavior:
Weights C NOT specified

  • mincount not specified --> returns 0's for cells with 0 points
  • mincount specified --> output array has no values for cells
    with < mincnt points (output array is actually smaller, not full of nan's)

Weights C ARE specified

  • mincount not specified --> mincnt=0, mean throws warning, only returns values
    where aggregator function could be computed (output array is smaller than total cells, not full of nan's)
  • mincount specified --> array has no values for cells with < mincnt points, still throws
    warning for mean if mincnt=0

Here are the paths forward I can see in the thread above and from thinking about it some more:

  1. Change default to mincount=1. Change docstring to mincount >= 0 to show that a user can specify 0 to include cells with 0 counts (or for other reduction functions that seamlessly handle 0 points)

  2. Change default min aggregator function to sum() so that the default behavior does not throw warnings. If someone switches to mean() they would have to navigate the warning message/specifiy mincnt=1 to avoid warnings. Could add something to that effect in the doc string?

  3. In the code, check if reduction function handles 0 pts gracefully or not and set mincnt to 0 or 1 accordingly (if not explicitly set). This seems like confusing magic under the hood, so I would probably not recommend it.

  4. Wrapper around np.mean that returns nan without warning. Mention in doc that reducer needs to be able to handle 0 points.

Am I missing any viable options? Any recommendations on which path forward to choose? I like 1 or 2, but I'm not sure how to weigh the importance of maintaining the same default behavior of a function from one code release to the next.

Thanks in advance for the guidance! I'm happy to make the changes if I can get a clear sense of what the best direction to take this in is!

@anntzer
Copy link
Contributor

anntzer commented Oct 31, 2021

You ran into a fairly complicated issue to start with, and you are doing well in keeping all relevant points in your head :-)

I agree with your description of the current behavior. We also try very hard not to change behavior, especially if there's no deprecation warning in between (silently changing the plots of end users is not nice). Here, I think the agreement is that we want mincnt to behave consistently regardless of whether C is given, so there will need to be a backcompat break at some point, and therefore a deprecation warning. I also agree that the magic of (3) is not really a good option (we can't know what warnings we need to suppress, and what warnings are legitimate ones that need to be propagated out).

In my view, defaulting reduce_C_function to mean() doesn't make sense, just like hist2d defaults to summing weights that fall into a bin, rather than averaging them. Given that we're already going to have a deprecation period for the change in mincnt semantics, we may just as well do both of them at once, and both switch to sum() and to inclusive mincnt. (This is something where I'd like other devs to weigh in, in particular @dstansby, who has recently played with hexbin().) Therefore, I guess the deprecation would be something like (unchecked...)

_unset = object()
def hexbin(..., mincnt=_unset, reduce_C_function=_unset):
    ...
    if C is None:  # old code is fine; normalize _unset to zero
    if C is not None:
        if reduce_C_function is _unset:
            # normalize to mean-wrapped-with-zero-handling, but warn that this
            # will use sum() in the future instead.
        # do as previously
        # if mincnt is set and any bin has exactly mincnt points, warn that the
        # semantics will change, but note that one can set mincnt=-1 to "easily"
        # suppress the warning

and in the future (after the deprecation period), even though the signature will now be

def hexbin(..., mincnt=0, reduce_C_function=np.sum):

still add a special-case for reduce_C_function is np.mean to wrap it with warning-suppression.

@jklymak
Copy link
Member

jklymak commented Nov 22, 2021

I don't think its as complicated as all that. Currently if C is set, and a bin is empty, the bin is set to NaN. The suggested change would simply be:

if len(vals) >= mincnt and len(vals) > 0:
       lattice1[i, j] = reduce_C_function(vals)
else:
       lattice1[i, j] = np.NaN

That doesn't require a deprecation, and is consistent with the docs. It does require an API note, since it is fixing a bug?

@jklymak jklymak marked this pull request as draft February 1, 2022 12:03
@jklymak
Copy link
Member

jklymak commented Feb 1, 2022

Popping to draft until the review comments can be dealt with. Feel free to mark active when ready...

@rcomer
Copy link
Member

rcomer commented Dec 12, 2023

Hi @RebeccaWPerry, thank you for all your work digging into this. It looks like the issue you were targeting has recently been fixed by another pull request (by basically the same approach). The fix was #26113, with some follow up in #27179 - a pretty gnarly problem!

So I'm going to close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants