-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Request: change hist bins default to 'auto' #16403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not a huge fan of black boxes as a default - nbins=10 is pretty easy to explain. Even astropy linked above defaults to nbins=10. "Matplotlib will arbitrarily change the number of bins from data set to data set based on characteristics of that data set" is less easy to explain and can lead to just as many interpretation errors, if not more, in my opinion. OTOH, if we wanted to allow automatic algorithms that the user directly specifies, I'm fine with that so long as those algorithms are well documented. Preferably they would be compatible with whatever is in |
That is already the case, right? You can specify either 'auto' or any of the other algorithms that numpy supports in I think there's a trade-off between "simple to explain" and "works well" and it's often not an obvious one. In this case, I think the black box works so much better that it would be worth it. FWIW I have to explain to dozens of students why |
And re the astropy docs, please check the original issue. I'm pretty sure Jake (or someone else at astropy) wrote this documentation just because the default value is bad. They are trying to educate people to move away from bad default values. [edit: they could have changed their default but didn't, so they might not remove it if the default changes. This whole function might predate the implementation in numpy and I'm not sure what it adds right now, maybe one new method?] |
Sorry I'm not sure what that was replying to. |
You said people coming from R to python are "shocked", so I was curious what R's algorithm is.... |
Ah. I think the default would be to use ggplot2. Which seems to have... 30? https://ggplot2.tidyverse.org/reference/geom_histogram.html |
Seaborn uses the freedman diaconis rule by default, which I believe is the default in base R plots (not sure about ggplot). Edit: actually I guess R uses sturges' rule by default, and numpy's auto bins is |
I am 👍 to this in principle (
I first thing to do would be to put |
One thing to keep in mind: the automatic rules can produce arbitrarily small binwidths. This can pose two problems with large inputs.
For these reasons, seaborn actually uses |
I'm fairly strongly against changing the default to "auto". In fact I actually have
It may seem hypocritical to argue against making "auto" the default even though that's what I personally use, but that's because I am well aware of the problems here, whereas the arguments above are in the direction of "we want to make auto the default because we don't want to require users to understand the underlying issues" -- but then we don't really want to start dealing with a bunch of bug reports claiming that "auto" gave nonsensical bins when it's really working "as intended" but "auto" is just a bad choice for the dataset at hand. I also think that the use cases in matplotlib are actually quite close to the ones in numpy (in the sense of, are there really many things that you would pass to np.histogram() that would not make sense passing to plt.hist()?), and thus, if anything, the change should go to numpy, not matplotlib. The deprecation strategy would be the same anyways... |
I checked Matlab, and they now use an auto algorithm in |
@anntzer While I agree it's not very principled, the
This in fact seems very common in data science applications. Generally, it's probably hard to make a statement like this without defining a distribution over datasets ;) My problem with 10 is that it looks like it worked but it actually gave a nonsensical result, and so the user has no opportunity to discover their error. |
The result may hide detail, but its not nonsense, nor is it an "error". Its just a coarse way of looking at the data. If you analyze data, and there is a choice like this, the first thing you should do is explore the consequences of making that choice on your data. Trusting an algorithm to make that choice, be it super-naive like n=10, or something fancy ("The binwidth is proportional to the interquartile range (IQR) and inversely proportional to cube root of a.size."), is not good data analysis practice. For that reason I prefer you start with the simple method and encourage trying the other methods. |
Real example of a case where 10 is less misleading than "auto": I have a camera whose pixel values are bell-shape distributed over a range of ~500, but quantized to integer values. The noise is not actually gaussian (a typical case would be an EMCCD camera at the limit of Poissonian photon counts) so I want to plot the distribution of pixel values (say 300x300 ~ 1e5 px) to examine it; here I'll generate normally distributed values for simplicity though.
When you have data like this (quantized), you basically don't want your bin count to be "close" (in order of magnitude) to your data range, but quite often "auto" will do exactly that, whereas 10 will be close to your data range only, well, if your data range is close to 10 (yes this sounds tautological but I think the idea is reasonable?). Also note that I am actually trying to fix this on numpy's side, starting with the case of bins of width <1 numpy/numpy#12150 but even that seems to be stuck on review. |
@jklymak I mostly agree, in that I think that's what people should be doing. So you can say that hiding detail of the data is fine and users not exploring is a user error. I'm not saying you're wrong, I'm just saying that the consequence is that people miss details in their data all the time because of the choice of defaults here and miss critical aspects of the data. @anntzer I agree this is an ugly artifact. My position is that it better if there are false positives for "huh that's odd" instead of false negatives. |
@anntzer your patch doesn't fix the issue you demonstrated, right? |
It doesn't, but it fixes a similar issue when the auto bin width is <<1 causing many spurious empty bins, which seems even more clearly wrong. If you look at the PR I briefly allude to the fact that for integer data I actually want an integer bin width, but I'm just trying to regularize <1 to 1 (which already seems hard enough to get merged!) before trying more "fancy" stuff (i.e. rounding other bin widths to the nearest integer). My position remains that if you really want it, this change of defaults should go to numpy. |
There would still need to be a change in matplotlib though, right? The default is “10 bins” not “defer to numpy’s default”. |
Re doing the simple fix first: fair, seems like a good strategy :) |
If the change goes to numpy we could reasonably do the change in sync. |
Overall, I think matplotlib should avoid data analysis algorithms as much as possible, but we have some holdovers from when we had a larger scope, and certainly |
True, we should add The "auto, but clip at N" should probably go to numpy.
This is fair, but the ship has sailed on pulling this one back. If we are going to have it, then we might as well make it nice. I take both @amueller and @mwaskom 's views on what "nice" means for data science / stats / AI-ML folks to be close to authoritative. Those may not be fields most of the core-devs work in day-to-day, but many (a majority?) of our users do. There is no default which is "right" for everyone in every domain, I think this is going to come down to a judgement about what default is the least bad for the most people. As a reminder of history, |
"np_default" is fine with me; however I state once again my question: why should "auto" as default go into mpl and not into numpy? |
#16471 for adding the To a large degree the users who this is for do not care where in the stack they are using the change is made, just that it is done somewhere. At this point, I am still leaning towards this change should be made in numpy, but to take the other side, I think it would be fair in numpy argued that they have exposed Maybe what we want is the ability to inject an alternative to |
Tactically, I think we should try to not drift from NP as much as possible so that we can pass bug reports to them. Strategically, I'm still against black boxes as defaults for either us or numpy. The fact that there are seven auto-binning algorithms should give us pause that there is not actually a consensus on the best way to automagically choose histogram bins. That shouldn't be a surprise, because choosing histogram bins is subjective, based on your underlying data and the scientific point you are trying to make with the histogram. @tacaswell, your digression above points to all the pitfalls of automatic algorithms. If some field has a standard algorithm that they all learn about in their 101 class, and they know the pitfalls, thats great and they should use that. But I don't think a foundational library should do anything fancy as the default. |
That's reasonable, but I think you're slightly over-representing how "fancy" some of the approaches are. e.g., Sturges rule (R's default) is to use Let's also keep in mind that the default choice of 10 bins is completely arbitrary and not really defensible on grounds other than how many fingers we have and status quo bias. I'm generally in favor of matplotlib maintaining status quo, so I don't actually feel very strongly here, but best to keep the discussion on its actual terms: an arbitrary fixed bin size vs. a simple reference rule. |
If the change is changed to "np_default" that seems good to me. |
Opened numpy/numpy#15569 in numpy. |
This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help! |
This is revisiting #4487 in which @jakevdp suggested changing the default of
bins
to 'auto'.Since automatic determination is now supported in matplotlib via numpy, I think it would be great to make it the default.
The main reason for wanting the change is that many people use this for data analysis, and the behavior of
bins=10
is pretty terrible in many cases (see Jake's example, still many people use the defaults.Good defaults matter. I'd love to keep educating people but no amount of educating will prevent people from using the defaults (we found this true in sklearn when mining github).
Many people use this from pandas and the actual implementation is in numpy, and @jklymak makes the case that matplotlib ideally delegates as much to numpy as possible. I am very sympathetic to this position.
My main claim is that somewhere the default should change.
Currently my position is that matplotlib is the best place for that. I don't think having pandas change the default would be as good as it would lead to inconsistencies between pandas and matplotlib. I would be happy with numpy changing the default, but the use cases of numpy are not necessarily related to visualization or even data analysis at all, so it's less clear to me that 'auto' is a good default there.
Also, from my perspective (and yours might be different), changing the default in numpy is more likely to break people's code and might require code changes, so the case for changing there needs to be really strong, and I think it's weaker than for matplotlib.
If you have good reasons to suggest changing the defaults in numpy, I'm happy for us all to figure this out together (data science user + numpy + matplotlib). But right now, the default behavior leads to people making bad inferences.
The text was updated successfully, but these errors were encountered: