-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
plt.hist density argument does not function as described. #10398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The histogram works as expected. The density argument is explained in the documentation
Hence if you calculate the integral,
the result is indeed 1.0, as expected. You might want to search stackoverflow.com on questions where people got confused by the |
@ryanpeach Which documentation did you find misleading? |
I guess I don’t understand why np.diff(bins) is ever relevant for the summation, “the counts normalized to form a pdf, ie, the area under the histogram will sum to one” literally means “[the sum of] the first element of the tupple will sum to 1” as the number of bins in a histogram is more like a resolution for the integral than it is an xvalue. I don’t understand how to generate the desired functionality. Maybe there should be an option for this if it’s such a common misunderstanding. |
I guess what I’m saying is, a pdf doesn’t include the xvalue in integrating to find the normalization constant. It’s the y value that must sum to 1. |
The area of each bar is (height * width), “the counts normalized to form a pdf, ie, the area (or integral) under the histogram will sum to one” translate to We changed the kwarg from 'normed', which could reasonably be either one, to 'density' to avoid this ambiguity. If you want your bars to sum to one (independent of the bin widths) then pass in |
@ryanpeach How would you re-write that docstring to be clearer to you? |
The numpy.histogram documentation is a bit more verbose on this part:
It also directly provides an example case which makes this case clear: Maybe the matplotlib documentation can link to the numpy docs or take this over? The comment above
is also a good one which could be added to the docs. Alternatively, in order not to blow up the docstrings, what about a dedicated histogram example? It seems there is currently only the pyplot-text example around showing how to use the @ryanpeach "a pdf doesn’t include the xvalue in integrating to find the normalization constant" is not correct; any integral takes the independent variable into account, i.e. in |
Ok, so I just learned the distinction between a pdf and a pmf. Why not just add normed as a flag and make it the pmf? This is a very common feature, and shouldn’t require hacks to perform.
|
I guess it will be hard to add another functionality to an argument which has been there for years, even though it being deprecated by now. So instead of However, one problem would be that the histogram can take unequal bin sizes; in this case normalizing to the sum of the counts is rather dangerous. Any such plot would essentially need a label "Attention, this plot has been normalized to the sum of the frequencies independent of the bin width" or so. Hence I'm not sure if adding this normalization will not lead to even more confusion. At the end why not stick to the above suggestion of
|
For a PMF perhaps you should use a different calculation and plotting method entirely: instead of bins, find the counts of values for all unique (discrete) values in the array, and plot each count at each discrete value, either with a marker or with a vertical line up from the origin. |
Eric, that’s a good idea, except for when you are dealing with semi-continuous things like integers, or continuous values that you still want divided into bins. Like, if you have 100 samples of a continuous distribution, your distribution will not look continuous without bins.
|
Make those two operations mutually exclusive (illegal to use unequal bin sizes during pmf)
Call it “mass” for consistency. |
Because I can’t hardly think of a time when a pdf would be useful in the context of a histogram, whereas pmfs are very natural use cases for them, yet we have a flag for the pdf and not one for the pmf. On top of that, weighting the values individually seems unintuitive, though I can see why it’s correct. It just seems like something the average user shouldn’t have to hack around. I’ll add it if that’s what it takes. |
This conversation has gotten more contentious than necessary.
This is pushing on a tension we have through out Matplotlib between methods with simple signatures that do one thing (and try to do it well) and methods with very complex signatures (and internals) that do many different things. Having kwargs that are mutually exclusive or have complex conditional dependencies between them quickly becomes hard to document and maintain (we already have things like this so this is not a theoretical concern). The signature and implementation of
Matplotlib has a wide enough user base that we have to be careful about generalization and appeals to 'intuitive'. From my background (physics), the density normalization seems 'natural' and normalizing by mass seems like a corner case where I am fine using the 'weights' method (but definitely get that my experience is not universal).
As I said above, I am wary of adding complexity to It may also be worth taking this question to numpy (a year or so ago, we had a request for optimal bin estimation in
How is this different than passing in the weights? |
I don't see any problem with unequal bin sizes; the point of the PMF is that the value for each bin is the probability (or more correctly, the relative frequency) of values landing in that bin; there is no reason the bins have to be the same size. The big difference between the PMF and the PDF is the way they change with the bin boundaries, given a continuous rather than discrete RV. So long as the bins cover the full range of values in both cases, the basic shape and level of the PDF is independent of the number of bins; it just gets noisier as the number of bins increases. For a PMF, in contrast, the general level is proportional to the bin width. If the bins are not the same size, there is nothing wrong with this--but it has to be kept in mind when interpreting the plot. In the case of a PMF bar plot, it means the visual mass of the bar is not a good indicator of frequency. Perhaps in an ideal world numpy would have provided a "normalization" kwarg with values of "counts", "pdf", and "pmf", (or "counts", "density", and "fraction") and we would just propagate that into our hist. |
I wrote a longer response, but I'll just say that a quick look on Google indicates that PMFs seem to be usually plotted as stem plots, as @efiring suggested above. The definition of a histogram (from wikipedia) is that it represents a finite approximation of a PDF, and thats how I've always understood it. I'd need to see some evidence that there is a large (published) community out there who want the PMF to be plotted as a histogram before I'd support adding a new kwarg. |
So to summarize: Documentation a user found the normalization (density) argument confusing. Seen from frequent questions about the topic on stackoverflow, lots of users are similarly confused. I guess reasons range from them having no math background at all, over the word "histogram" being frequently used for just any barplot, to not remembering the definition of a pdf. Extended functionality of Such a plot can be useful to directly grasp the binned probability. E.g. we can directly read that 37% of all products cost between 100 and 140 dollars. This plot can easily produced with |
I think it would be useful to provide visual examples such as that for each of the modes we are discussing. I’m still pretty confused as to any true usecase of the density argument for the histogram function as-is. And a stem-leaf plot? Please provide an example of using that to get a pmf. The usecase @ImportanceOfBeingErnest has documented is almost certainly the normal users intention for using a histogram, and intuitively a histogram would be the first thing such a user would try to get such information visually. And even if we disagree on that subjective polling, we can’t disagree that it’s both valid and popular as a feature request, and that the provided solution at the very least requires a search to discover. Not only that, but it’s easy to implement and consistent with existing implemented kwargs. |
That’s completely unhelpful. |
Sure it shows a stem plot for a pmf, but nothing about a matplotlib implementation. |
And why that shouldn’t just be a matplotlib histogram with the bar widths reduced to zero and a mass kwarg set to true is beyond me. |
Because we have |
I suppose. But what you’re asking users to do is to use np.hist, normalize themselves, and then do this https://matplotlib.org/devdocs/gallery/lines_bars_and_markers/stem_plot.html#sphx-glr-gallery-lines-bars-and-markers-stem-plot-py Whereas a someone coming from an office environment is going to see plt.hist, think about how they might do it in excel, and expect it to just work. |
You may argue that a stem is just a bar with zero width. However, technically, they are different objects in matplotlib (rectangle vs. line). Thus one as to use the appropriate function. |
It might be feasible to add a |
Then I would argue that there needs to be something that is the stem version of plt.hist... Like, why are we wrapping np.hist if not to make this easy. |
I think the stem is appropriate for the specific case that I mentioned much earlier, when there would be a stem for each unique discrete value. It is not appropriate for the case where there are bins holding values on a continuum, or more than a single integer, because it doesn't show the bin boundaries. |
Ok, under histtype we have loads of options for what kind of plot to use. Let users decide, include ‘stem’ in that list, make it automatic if mass is used and histtype is left default. |
True I agree we need to show the bins boundaries for continuous plots. See my suggestion above. Maybe don’t make it automatic. |
Here’s another idea, gleamed from this Wikipedia article https://en.m.wikipedia.org/wiki/Probability_density_function The top figure is scaled on the xaxis to multiples of sigma. What if we used this math instead (or include the option for it)
Which basically cancels and gets rid of our problem of different bin sizes (maybe?) This would give us scale invariance (which is really the problem with the PDFs as they are now). |
The issue about the pdf and the pmf is slightly pedantic anyway because there is no such thing as a continuous function in numpy/matplotlib. We are always dealing with discrete floating point numbers, and thus a distribution either with |
No, this is not a matter of being pedantic. There is a big difference between looking at frequencies of discrete events (e.g., counts of Fords, Hondas, etc. going through a tollboth in a day) and quantities that are binned such that there are several or many different values being grouped together in each of several bins. The fact that there is a finite number of floating point numbers that can be represented in a computer with a single or double-precision float is irrelevant. |
But these aren't mutually exclusive. For instance, my example and @ImportanceOfBeingErnest 's example are binned, continuous, and thus demand a bar plot. However, they do not require normalization via the absolute area under the curve, rather by the normalized xaxis area under the curve. A stem chart won't work for them, because they aren't a selection of a few discrete things, yet the density function won't work for them either, because it cares about the absolute area under the curve. Is that a pdf or a pmf? You tell me. |
I think this inevitably comes down, on a technical level, to an argument over limiting choice within a function that was designed for convienience. All of this could be done by hand, plt.hist could be removed and let everyone just use bar plots or stem plots with np.hist till the end of time. But it's obvious that someone thought including plt.hist was important for convenience, but didn't take into account all the popular ways of using it. Now we are stuck looking at wikipedia definitions to mathematical terms that have at best a passing relation to the data visualization tequnique of a histogram, and the normalization constants commonly associated with one, and are arguing over how a user "should" do normalization, how they "should" visualize probabilities and frequencies, how they "should" bin things. Call those normalization options whatever you want, make pretty and "cannonical" defaults, but not having them because they don't fit a mold (playing favorites) even though plenty of people use them every day is what is pedantic. |
There is a good principle when it comes to plotting: Keep data aggregation and manipulation separate from data visualization. In view of this, I am currently not sure if there is actually anything matplotlib should do, other than the two points mentioned above: (1) improve documentation (2) decide upon adding an additional frequency normalization argument. |
Ok, just let me know. I just hope you don't choose to ignore (2) while maintaining the density kwarg, that's just biased. I'll let the thread linger till you all make a decision but still offer my services in making the pull request. |
FWIW, Numpy has a similar issue thread: numpy/numpy#9921 (but it got significantly less traction, to say the less). |
I strongly believe |
The two definitions are the same if you have equally spaced bins (which is the most usual situation). I’d agree that if numpy accepts mass we can revisit this and follow their lead. @ryanpeach if you want to continue discussion, you should chime in on the numpy issue linked above, or of course another dev is free to re-open this if they disagree... |
A more or less recent decision was to make FWIW, after a bit of googling, it seems like some (and not most of those that I checked¹) plotting softwares offer the feature that is discussed here through a keyword argument like “norm”/“density”/etc. that can take string inputs among “pdf”, “cdf”, “counts”, “probability” (usually what is discussed here), “sf”, etc. https://www.mathworks.com/help/matlab/ref/histogram.html Edit: burnt ^^. |
Just posted on the numpy thread. I don't think it's numpy's job to do this because normalization is easy to do mathematically post-hoc. It's just the graphing that's hard. It's matplotlibs wrapper that makes it difficult to do intermediate steps on the binning like normalization. But I digress. |
@nunocalaim Thanks for reaching out. However I do not think your issue is in any way related. Would you mind opening a new issue and include a self-contained example (i.e. a code one can copy, paste and run) to allow people to reproduce your problem? This is the only way one can find out if the outcome is expected or whether there is a problem in the matplotlib code base. |
Note that we now have I'm pretty concerned with Matplotlib mediating multiple normalization conventions among fields. But if someone can come up with a clean way to accommodate everyone, or write their own wrapper, I imagine we could consider that. |
Bug report
Bug summary
plt.hist density argument does not function as described.
Code for reproduction
Actual outcome
Expected outcome
Matplotlib version
The text was updated successfully, but these errors were encountered: