Skip to content

[ENH]: Adapt 2-dim histogram axis limits to content #26288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hvass-Labs opened this issue Jul 11, 2023 · 9 comments
Closed

[ENH]: Adapt 2-dim histogram axis limits to content #26288

Hvass-Labs opened this issue Jul 11, 2023 · 9 comments

Comments

@Hvass-Labs
Copy link

Problem

Plotting a 2-dim histogram using the hist2d function can leave a lot of excess space in the plot when using the argument cmin to exclude bins with insufficient numbers of data-points. (There's presumably a similar problem when using the argument cmax).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Generate random x/y values.
rng = np.random.default_rng(1234)
size = 100000
x = rng.normal(scale=10, size=size)
y = rng.normal(size=size)

# This plot has properly adjusted axis-limits.
plt.hist2d(x, y, bins=100, cmin=1);

image

# This plot has a lot of excess space.
plt.hist2d(x, y, bins=100, cmin=20);

image

Proposed solution

The following is a quick solution that seems to work. But it may be possible to make a better solution when integrating it into the hist2d function. It should be made as an optional argument such as hist2d(adapt_lim=True, ...) to adapt both x and y-axis, or specify a given axis such as hist2d(adapt_lim='x', ...)

I don't really have time to add this to Matplotlib's code-base myself, and make sure everything is made according to your standards etc. So hopefully someone else can do it, if you like the feature. Thanks!

# Plot 2-dim histogram and get bins and edges.
h, xedges, yedges, _ = plt.hist2d(x, y, bins=100, cmin=20);

# Boolean mask whether a bin is actually used in the 2-dim grid.
mask = ~np.isnan(h)

# Flattened boolean masks whether a bin is actually used for each axis.
mask_flat_x = np.any(mask, axis=1)
mask_flat_y = np.any(mask, axis=0)

# Get x-axis min/max for bins that are actually used.
x_min = xedges[:-1][mask_flat_x].min()
x_max = xedges[1:][mask_flat_x].max()

# Get y-axis min/max for bins that are actually used.
y_min = yedges[:-1][mask_flat_y].min()
y_max = yedges[1:][mask_flat_y].max()

# Adjust x-axis limits to have a little padding.
x_pad = 0.01 * (x_max - x_min)
plt.xlim(x_min - x_pad, x_max + x_pad)

# Adjust y-axis limits to have a little padding.
y_pad = 0.01 * (y_max - y_min)
plt.ylim(y_min - y_pad, y_max + y_pad)

plt.show();

image

@jklymak
Copy link
Member

jklymak commented Jul 11, 2023

cmin sets undervalues to NaN in hist2d? It should set to the colormaps' "under" color. that is a pretty outrageous inconsistency.

@Hvass-Labs
Copy link
Author

I actually need the plot to be transparent when bin-counts are too low, so hopefully that will still be possible if you change it to colormaps' "under" color.

@tacaswell
Copy link
Member

I very much disagree that the second example is "wrong". From a library point of view we should show the user the data that was passed to us, not the data the user passed us that we might want. Put another way,

I think wrapping that code up in a helper like

def shrink_wrap_hist_and_hide_existinace_of_outliers(h, xedges, yedges, pc):
    ...

or just wrapping the call to hist2d to do any pre/post processing that make sense.

is the best option and I am very 👎🏻 on adding a flag to hist2d and think this should be closed with no-action.


It looks like setting the under values to NaN goes back to when hist2d came into the library in #826 .

From a strictly visual view:

diff --git a/lib/matplotlib/axes/_axes.py b/lib/matplotlib/axes/_axes.py
index 4eff4ec1b5..3a3dae0711 100644
--- a/lib/matplotlib/axes/_axes.py
+++ b/lib/matplotlib/axes/_axes.py
@@ -7190,15 +7190,11 @@ such objects
         h, xedges, yedges = np.histogram2d(x, y, bins=bins, range=range,
                                            density=density, weights=weights)

-        if cmin is not None:
-            h[h < cmin] = None
-        if cmax is not None:
-            h[h > cmax] = None
-
-        pc = self.pcolormesh(xedges, yedges, h.T, **kwargs)
+        pc = self.pcolormesh(xedges, yedges, h.T, vmin=cmin, **kwargs)
         self.set_xlim(xedges[0], xedges[-1])
         self.set_ylim(yedges[0], yedges[-1])
-
+        if cmin is not None and 'cmap' not in kwargs:
+            pc.cmap.set_under((0, 0, 0, 0))
         return h, xedges, yedges, pc

     @_preprocess_data(replace_names=["x", "weights"], label_namer="x")

is enough for many cases (the edge case of the user passed in cmap is a bit dicy...we have to decide when to "fix" the under color but that seems tractable), but the changed return value is a hard one to warn on / manage the deprecation of.

Setting the "under" values to nan is the correct conceptual behavior as we are doing the binning in some cases. Saying "I want to do the binning and any bin that has under some count should be treated as missing data" is a reasonable thing to want to do and it is pretty clearly documented: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist2d.html#matplotlib.axes.Axes.hist2d

As near as I can tell we do not actually use cmin / cmax anywhere else in the library (the method on ScalarMappable is set_clim but the parametrs are named vmin and vmax 🤷🏻 ) so I do not think there is a terrible inconsistency.

So, on net, although I am with @jklymak in being surprised by the behavior of treating out-of-bounds data as missing data, it is a reasonable, documented, and long standing behavior so it is not a candidate for deprecation or change.

@tacaswell tacaswell closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2023
@Hvass-Labs
Copy link
Author

@tacaswell Thanks very much for your extensive comment!

I think our disagreement boils down to what you said here:

"we should show the user the data that was passed to us, not the data the user passed us that we might want"

But we are using cmin to remove outliers from the data, and it is actually very useful because it can remove "2-dim outliers" that are otherwise very difficult / impossible to remove. Let me give an example.

First we have cmin=0 where the entire plot is blue so we cannot see the background:

cmin=0

Second we have cmin=3 which reveals the background and also shows the sparsely populated areas better, but there is a lot of "empty space" on top:

cmin=3

Finally we have cmin=3 and the adjustment of the axis-limit using the algorithm above, which has removed the empty space on top (and added small margins to both axis):

cmin=3 and adjust axis limits

This is a very typical example of what I experience with hist2d in my application, but for more irregular data the empty space can be much more extreme, e.g. 2/3 of the y-axis is empty space. So I really do need to remove it.

Because we are using cmin to remove outliers so they are no longer shown, I think it is also reasonable to have an argument in the function hist2d to remove the empty space that results. But I don't think it should always remove the empty space, because some users may not want to remove the empty space for whatever reason. So an option would be good.

(Apart from this issue, I think Matplotlib has made some beautiful plots here!)

@tacaswell
Copy link
Member

Part of the core problem here is that hist2d is doing both data reduction and plotting so the parameters have confusing and coupled semantics.

One other aspect of the missing data problem is if you treat it as "there should be data here but we found nothing" vs "there is nothing to see here, not even the possibility that there could have been data". In the first case it is thing like in a map showing the outline of areas where there is missing data (because it is interesting information that state X has no data) vs auto scaling to the data given (because while there is other possible data .... we don't have it).

I think that given this ambiguity here I the default should be "it is interesting that this data is missing". I do not have a good sense of how common the second case is for hist2d (is this a widespread need and you are the first person to speak up or are you the only person in the world who wants this ;) ).

If we were to add a way to tell hist2d to implement the "nothing to see here" version we should do it by trimming the inputs to pcolormesh and letting the normal margin / autolimit logic do its thing.

@Hvass-Labs
Copy link
Author

There's not many people referring to hist2d and the cmin argument in GitHub code. This search shows the code on GitHub having hist2d and cmin=1 and there are 390 results, compared to 28k results when omitting cmin=1. I looked at a few results but so many people have the bad habit of not commenting their code, so I don't know why they're using cmin.

I use cmin=1 to reveal the background of the plot when there is no data in a histogram bin. And I use e.g. cmin=3 to remove 2-dim outliers that I cannot remove any other way. I imagine others use cmin for the same reasons. But this sometimes creates a lot of free space in the plot that looks weird, sometimes occupying 2/3 of the y-axis in my plots.

My plots are generated from simulated data. The data doesn't come from a data-file. So the user doesn't know what data "should" be there. The user simply specifies whether they want to remove outliers or not. I also use another traditional method of removing extreme outliers that are too far from the center of the distribution, but that cannot remove "2-dim" outliers, which hist2d does very easily using cmin.

So I really need the cmin argument in hist2d, and I really need to remove the empty space that results from removing the outliers.

The docs describe what cmin does, but doesn't specify the intended purpose:

cmin, cmax : float, default: None
All bins that has count less than cmin or more than cmax will not be displayed (set to NaN before passing to imshow) and these count values in the return value count histogram will also be set to nan upon return.

It's a fairly isolated feature to add an adapt_lim argument to hist2d, that won't break anything else in Matplotlib, and I think I have given very good arguments that it belongs there and is useful. If the default argument is None then it defaults to the current behavior.

Allow me to say, that I think we are both wasting too much time on this. Especially as someone has already taken my code above and turned it into a PR. I don't mean to be rude as you guys are doing a great job with Matplotlib, but it does feel rather difficult to get something accepted into the code-base, and that is unfortunate in an open-source project.

Whether this feature should be implemented as I've done above, or in some of the other ways you suggest, I can't answer. But unless you or someone else wants to implement it that way, then I hope you would just move ahead with the working solution we have here.

The person who opened the PR is apparently a young student who wants to contribute to open-source, so it might be a bit demoralizing for them to get shut down like that. I will give the person a few comments on the PR, and it might be a good investment for you guys to give the person a few helpful comments as well, so they might come back and contribute more in the future.

If we still disagree on whether this feature belongs in Matplotlib, what is the procedure for "conflict resolution" here? Do you vote? Or are the core developers the "ruling junta" so they just get to decide? :-)

@jklymak
Copy link
Member

jklymak commented Aug 2, 2023

but it does feel rather difficult to get something accepted into the code-base, and that is unfortunate in an open-source project.

This is a huge codebase, and we tend to be very conservative adding new behaviour. A new change needs to argue that it cannot be easily done outside the library. In this case, it's just a case of using numpy.histogram2d and trimming the data before sending to pcolormesh.

It's a fairly isolated feature to add an adapt_lim argument to hist2d,

It sounds like an isolated feature, but if it only goes into hist2d it would be an API inconsistency. Any Artist we have that does color mapping can have invisible elements (contour, pcolormesh, scatter, etc). We do not currently change the autolimit behaviour for any of these, and doing so consistently would be a big job.

Or are the core developers the "ruling junta" so they just get to decide?

Thats one way to express it. The more charitable way to say it is that the core developers have many years of experience put into the project and have to make judgement calls, particularly as they will likely assume the maintenance burden of any new features. OTOH, the ruling junta is relatively permissive with inviting new members who show interest with PRs and contributions.

@rcomer
Copy link
Member

rcomer commented Aug 2, 2023

A new change needs to argue that it cannot be easily done outside the library. In this case, it's just a case of using numpy.histogram2d and trimming the data before sending to pcolormesh.

The solution in the OP also uses only public API. Though I think it might benefit from the hint in #25962.

@tacaswell
Copy link
Member

https://snarky.ca/setting-expectations-for-open-source-participation/ (longer) and https://textual.textualize.io/blog/2023/07/29/pull-requests-are-cake-or-puppies/ (shorter) are good posts about the sort of reasoning that maintainers go through when evaluating PRs and feature requests.

Our rules for merging are documented here: https://matplotlib.org/devdocs/devel/coding_guide.html#merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants