Skip to content

Request: change hist bins default to 'auto' #16403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Feb 3, 2020 · 30 comments
Closed

Request: change hist bins default to 'auto' #16403

amueller opened this issue Feb 3, 2020 · 30 comments
Labels
API: changes status: closed as inactive Issues closed by the "Stale" Github Action. Please comment on any you think should still be open. status: inactive Marked by the “Stale” Github Action topic: hist

Comments

@amueller
Copy link
Contributor

amueller commented Feb 3, 2020

This is revisiting #4487 in which @jakevdp suggested changing the default of bins to 'auto'.
Since automatic determination is now supported in matplotlib via numpy, I think it would be great to make it the default.

The main reason for wanting the change is that many people use this for data analysis, and the behavior of bins=10 is pretty terrible in many cases (see Jake's example, still many people use the defaults.
Good defaults matter. I'd love to keep educating people but no amount of educating will prevent people from using the defaults (we found this true in sklearn when mining github).

Many people use this from pandas and the actual implementation is in numpy, and @jklymak makes the case that matplotlib ideally delegates as much to numpy as possible. I am very sympathetic to this position.

My main claim is that somewhere the default should change.

Currently my position is that matplotlib is the best place for that. I don't think having pandas change the default would be as good as it would lead to inconsistencies between pandas and matplotlib. I would be happy with numpy changing the default, but the use cases of numpy are not necessarily related to visualization or even data analysis at all, so it's less clear to me that 'auto' is a good default there.

Also, from my perspective (and yours might be different), changing the default in numpy is more likely to break people's code and might require code changes, so the case for changing there needs to be really strong, and I think it's weaker than for matplotlib.

If you have good reasons to suggest changing the defaults in numpy, I'm happy for us all to figure this out together (data science user + numpy + matplotlib). But right now, the default behavior leads to people making bad inferences.

@jklymak jklymak added this to the v3.3.0 milestone Feb 3, 2020
@jklymak
Copy link
Member

jklymak commented Feb 3, 2020

I'm not a huge fan of black boxes as a default - nbins=10 is pretty easy to explain. Even astropy linked above defaults to nbins=10. "Matplotlib will arbitrarily change the number of bins from data set to data set based on characteristics of that data set" is less easy to explain and can lead to just as many interpretation errors, if not more, in my opinion.

OTOH, if we wanted to allow automatic algorithms that the user directly specifies, I'm fine with that so long as those algorithms are well documented. Preferably they would be compatible with whatever is in numpy.histogram

@amueller
Copy link
Contributor Author

amueller commented Feb 3, 2020

OTOH, if we wanted to allow automatic algorithms that the user directly specifies, I'm fine with that so long as those algorithms are well documented.

That is already the case, right? You can specify either 'auto' or any of the other algorithms that numpy supports in numpy.histogram.
So you can do it, but you do need to change a default.

I think there's a trade-off between "simple to explain" and "works well" and it's often not an obvious one. In this case, I think the black box works so much better that it would be worth it.

FWIW I have to explain to dozens of students why plt.hist does something nonsensical by default. In particular people coming from R to Python are pretty shocked.

@amueller
Copy link
Contributor Author

amueller commented Feb 3, 2020

And re the astropy docs, please check the original issue. I'm pretty sure Jake (or someone else at astropy) wrote this documentation just because the default value is bad. They are trying to educate people to move away from bad default values.
If the default value would be "auto", they would probably just remove their function and the page I linked to.

[edit: they could have changed their default but didn't, so they might not remove it if the default changes. This whole function might predate the implementation in numpy and I'm not sure what it adds right now, maybe one new method?]

@jklymak
Copy link
Member

jklymak commented Feb 3, 2020

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

Sorry I'm not sure what that was replying to.

@jklymak
Copy link
Member

jklymak commented Feb 4, 2020

You said people coming from R to python are "shocked", so I was curious what R's algorithm is....

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

Ah. I think the default would be to use ggplot2. Which seems to have... 30? https://ggplot2.tidyverse.org/reference/geom_histogram.html
That is ... curious and potentially supporting your point. I'm quite surprised, I kinda want to ask Hadley now. Seaborn also doesn't seem to touch the default of 10, though I'm not sure if that's for consistency or a deeper reason. Maybe @mwaskom cares to weigh in?

@mwaskom
Copy link

mwaskom commented Feb 4, 2020

Seaborn uses the freedman diaconis rule by default, which I believe is the default in base R plots (not sure about ggplot).

Edit: actually I guess R uses sturges' rule by default, and numpy's auto bins is max(FD, sturges).

@tacaswell
Copy link
Member

I am 👍 to this in principle (hist does so many other magical things and there is an easy way to get back), however:

  • is this a small enough API change to make in a minor release?
  • if no, are there any other default changes people want that we can bundle into a 4.0 for later this year calendar year ?

I first thing to do would be to put bins='auto' in every call to hist in our examples.

@mwaskom
Copy link

mwaskom commented Feb 4, 2020

One thing to keep in mind: the automatic rules can produce arbitrarily small binwidths. This can pose two problems with large inputs.

  • Plotting all of the bars can get slow (although this seems to be less of a problem than it used to)
  • If edgecolor != facecolor or rcParams["patch.force_edgecolor"] == True, at a certain point the histogram will appear as a solid area in the edgecolor. This may surprise people, especially in the latter case where hist will appear to just ignore the color param.

For these reasons, seaborn actually uses min(fd_bins, 50) by default. But that makes it less of a principled reference rule...

@anntzer
Copy link
Contributor

anntzer commented Feb 4, 2020

I'm fairly strongly against changing the default to "auto".

In fact I actually have hist.bins: auto in my matplotlibrc :-) but this only makes me more aware of the limitations of "auto":

  • As mentioneed by @mwaskom, this can lead to arbitrarily small binwidths and veeeeery, veeeeery slow plotting (somehow my datasets seem to do this quite often... typically it's stuff with distant outliers).
  • If the underlying data is discrete, the auto choice can be fairly misleading (ENH: When histogramming data with integer dtype, force bin width >= 1. numpy/numpy#12150) -- 10 is "fine" unless you happen to just have a dozen discrete values, which is less common (whereas "auto" can be bad regardless of the number of discrete values you have).
  • "auto" may or may not be a particularly good choice when histogramming multiple datasets at once (right now we just treat the concatenated datasets as a single one for bin-selection purposes) or weighted datasets (I think we just ignore the weights? not sure, though).

It may seem hypocritical to argue against making "auto" the default even though that's what I personally use, but that's because I am well aware of the problems here, whereas the arguments above are in the direction of "we want to make auto the default because we don't want to require users to understand the underlying issues" -- but then we don't really want to start dealing with a bunch of bug reports claiming that "auto" gave nonsensical bins when it's really working "as intended" but "auto" is just a bad choice for the dataset at hand.

I also think that the use cases in matplotlib are actually quite close to the ones in numpy (in the sense of, are there really many things that you would pass to np.histogram() that would not make sense passing to plt.hist()?), and thus, if anything, the change should go to numpy, not matplotlib. The deprecation strategy would be the same anyways...

@jklymak
Copy link
Member

jklymak commented Feb 4, 2020

I checked Matlab, and they now use an auto algorithm in histogram by default. The old hist default was 10 bins which I guess is where our hist came from. It might be worthwhile combing their forum to see how much pain the new default caused.

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

@anntzer While I agree it's not very principled, the max('auto', 50) probably gets rid of the slowness issue.
And it's not that I don't want to educate users about the shortcomings. I know that I will not be able to educate all users. Having 10 bins means that the user is likely unaware if the bins are too wide, because it just looks like there's no structure. If there's too many bins, I think the danger of not being aware that there's a problem is lower.
I'm not entirely sure I understand your main point, though. Basically the 'auto' choice is not always perfect, but I would argue it's basically always better than 10.

unless you happen to just have a dozen discrete values, which is less common

This in fact seems very common in data science applications. Generally, it's probably hard to make a statement like this without defining a distribution over datasets ;)

My problem with 10 is that it looks like it worked but it actually gave a nonsensical result, and so the user has no opportunity to discover their error.

@jklymak
Copy link
Member

jklymak commented Feb 4, 2020

My problem with 10 is that it looks like it worked but it actually gave a nonsensical result, and so the user has no opportunity to discover their error.

The result may hide detail, but its not nonsense, nor is it an "error". Its just a coarse way of looking at the data.

If you analyze data, and there is a choice like this, the first thing you should do is explore the consequences of making that choice on your data. Trusting an algorithm to make that choice, be it super-naive like n=10, or something fancy ("The binwidth is proportional to the interquartile range (IQR) and inversely proportional to cube root of a.size."), is not good data analysis practice. For that reason I prefer you start with the simple method and encourage trying the other methods.

@anntzer
Copy link
Contributor

anntzer commented Feb 4, 2020

Real example of a case where 10 is less misleading than "auto": I have a camera whose pixel values are bell-shape distributed over a range of ~500, but quantized to integer values. The noise is not actually gaussian (a typical case would be an EMCCD camera at the limit of Poissonian photon counts) so I want to plot the distribution of pixel values (say 300x300 ~ 1e5 px) to examine it; here I'll generate normally distributed values for simplicity though.

from pylab import *
axs = figure().subplots(2)
axs[0].hist((np.random.randn(100000) * 100).astype(int), "auto")
axs[1].hist((np.random.randn(100000) * 100).astype(int), 10)
show()

out
Notice the funny dips in the "auto" histogram? This is because the bin size is ~5.8 which means that 3 bins out of 4 cover 6 integers but the 4th bin only covers 5 integers, which means that the count there is naturally 1/6th less.

When you have data like this (quantized), you basically don't want your bin count to be "close" (in order of magnitude) to your data range, but quite often "auto" will do exactly that, whereas 10 will be close to your data range only, well, if your data range is close to 10 (yes this sounds tautological but I think the idea is reasonable?).

Also note that I am actually trying to fix this on numpy's side, starting with the case of bins of width <1 numpy/numpy#12150 but even that seems to be stuck on review.

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

@jklymak I mostly agree, in that I think that's what people should be doing.
Having taught data science to many people, I found there is no way to actually make people do it though. I don't really like an appeal to authority but between me and Jake, we probably taught thousands of people on data science, not counting our (mostly his) books and at least to me it was clear that there are issues getting people to actually explore settings.

So you can say that hiding detail of the data is fine and users not exploring is a user error. I'm not saying you're wrong, I'm just saying that the consequence is that people miss details in their data all the time because of the choice of defaults here and miss critical aspects of the data.

@anntzer I agree this is an ugly artifact. My position is that it better if there are false positives for "huh that's odd" instead of false negatives.

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

@anntzer your patch doesn't fix the issue you demonstrated, right?

@anntzer
Copy link
Contributor

anntzer commented Feb 4, 2020

It doesn't, but it fixes a similar issue when the auto bin width is <<1 causing many spurious empty bins, which seems even more clearly wrong. If you look at the PR I briefly allude to the fact that for integer data I actually want an integer bin width, but I'm just trying to regularize <1 to 1 (which already seems hard enough to get merged!) before trying more "fancy" stuff (i.e. rounding other bin widths to the nearest integer).

My position remains that if you really want it, this change of defaults should go to numpy.

@mwaskom
Copy link

mwaskom commented Feb 4, 2020

There would still need to be a change in matplotlib though, right? The default is “10 bins” not “defer to numpy’s default”.

@amueller
Copy link
Contributor Author

amueller commented Feb 4, 2020

Re doing the simple fix first: fair, seems like a good strategy :)

@anntzer
Copy link
Contributor

anntzer commented Feb 4, 2020

If the change goes to numpy we could reasonably do the change in sync.

@jklymak
Copy link
Member

jklymak commented Feb 4, 2020

Overall, I think matplotlib should avoid data analysis algorithms as much as possible, but we have some holdovers from when we had a larger scope, and certainly hist is one of those. We have some in the spectral domain, and a few other stats ones (boxplot). Given that, I think that whenever possible methods like hist should pass through data-analysis parameters like nbins to numpy (and/or scipy), where the algorithms are implimented, and let them decide the best defaults, presumably based on community preference.

@tacaswell
Copy link
Member

There would still need to be a change in matplotlib though, right?

True, we should add 'np_default' as a valid value in the rcparams. It will be a bit annoying (because we can't just pass None through to numpy, we have to not pass bins in at all to get this behavior). This should be done no matter how this discussion lands.

The "auto, but clip at N" should probably go to numpy.

Overall, I think matplotlib should avoid data analysis algorithms as much as possible,

This is fair, but the ship has sailed on pulling this one back. If we are going to have it, then we might as well make it nice. I take both @amueller and @mwaskom 's views on what "nice" means for data science / stats / AI-ML folks to be close to authoritative. Those may not be fields most of the core-devs work in day-to-day, but many (a majority?) of our users do.

There is no default which is "right" for everyone in every domain, I think this is going to come down to a judgement about what default is the least bad for the most people.

As a reminder of history, bins='auto' is in numpy because we got a feature request for it in Matplotlib, had a very similar conversation, and it turned into a numpy PR.

@anntzer
Copy link
Contributor

anntzer commented Feb 11, 2020

"np_default" is fine with me; however I state once again my question: why should "auto" as default go into mpl and not into numpy?

@tacaswell
Copy link
Member

#16471 for adding the 'np_default' value as allowed input to ax.hist.

To a large degree the users who this is for do not care where in the stack they are using the change is made, just that it is done somewhere.

At this point, I am still leaning towards this change should be made in numpy, but to take the other side, I think it would be fair in numpy argued that they have exposed np.histogram_bin_edges as public API and that adding support for passing strings into np.histogram was a mistake. If some down-stream library (aka us) wants to do some fancy binning selection that is their problem. Maybe we want to put guard rails on the auto binning (set both a min and a max count? set a min/max bin width? let auto pick the best one and then clip? run a bunch of them, pick from those that satisfy the conditions and then clip? Try to run outlier detection and exclude the extremes from the bin analysis (that one is scary as we don't currently do anything with the over/under values?) Once you start going down that degree of complexity.

Maybe what we want is the ability to inject an alternative to np.histogram_bin_edges into ax.hist ?

@jklymak
Copy link
Member

jklymak commented Feb 11, 2020

Tactically, I think we should try to not drift from NP as much as possible so that we can pass bug reports to them.

Strategically, I'm still against black boxes as defaults for either us or numpy.

The fact that there are seven auto-binning algorithms should give us pause that there is not actually a consensus on the best way to automagically choose histogram bins. That shouldn't be a surprise, because choosing histogram bins is subjective, based on your underlying data and the scientific point you are trying to make with the histogram. @tacaswell, your digression above points to all the pitfalls of automatic algorithms. If some field has a standard algorithm that they all learn about in their 101 class, and they know the pitfalls, thats great and they should use that. But I don't think a foundational library should do anything fancy as the default.

@mwaskom
Copy link

mwaskom commented Feb 11, 2020

I don't think a foundational library should do anything fancy as the default.

That's reasonable, but I think you're slightly over-representing how "fancy" some of the approaches are. e.g., Sturges rule (R's default) is to use 1 + 3.3 log n bins, which is easy to express and comprehend. There are more baroque computational approaches, like cross-validation, but those aren't under consideration here.

Let's also keep in mind that the default choice of 10 bins is completely arbitrary and not really defensible on grounds other than how many fingers we have and status quo bias.

I'm generally in favor of matplotlib maintaining status quo, so I don't actually feel very strongly here, but best to keep the discussion on its actual terms: an arbitrary fixed bin size vs. a simple reference rule.

@amueller
Copy link
Contributor Author

If the change is changed to "np_default" that seems good to me.
while I think that having failsafes would be nice, I see the concern with having additional logic in matplotlib.

@amueller
Copy link
Contributor Author

Opened numpy/numpy#15569 in numpy.

@QuLogic QuLogic modified the milestones: v3.3.0, v3.4.0 May 23, 2020
@QuLogic QuLogic modified the milestones: v3.4.0, v3.5.0 Jan 27, 2021
@QuLogic QuLogic modified the milestones: v3.5.0, v3.6.0 Sep 25, 2021
@QuLogic QuLogic modified the milestones: v3.6.0, unassigned Jul 8, 2022
@story645 story645 modified the milestones: unassigned, needs sorting Oct 6, 2022
Copy link

github-actions bot commented Nov 3, 2023

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Nov 3, 2023
@github-actions github-actions bot added the status: closed as inactive Issues closed by the "Stale" Github Action. Please comment on any you think should still be open. label Dec 4, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API: changes status: closed as inactive Issues closed by the "Stale" Github Action. Please comment on any you think should still be open. status: inactive Marked by the “Stale” Github Action topic: hist
Projects
None yet
Development

No branches or pull requests

8 participants