Skip to content

plt.hist density argument does not function as described. #10398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryanpeach opened this issue Feb 8, 2018 · 45 comments
Closed

plt.hist density argument does not function as described. #10398

ryanpeach opened this issue Feb 8, 2018 · 45 comments
Labels
status: needs clarification Issues that need more information to resolve.

Comments

@ryanpeach
Copy link

Bug report

Bug summary

plt.hist density argument does not function as described.

Code for reproduction

import matplotlib.pyplot as plt
import numpy as np
x = np.asarray([0,0,0,0,1,1,2,2,2,2,3,3,3,3,3,3,3], dtype="float")
n, bins, _ = plt.hist(x, density=1, bins=2)
print(n)
print(np.sum(n))

Actual outcome

[0.23529412 0.43137255]
0.6666666666666666

Expected outcome

import matplotlib.pyplot as plt
import numpy as np
x = np.asarray([0,0,0,0,1,1,2,2,2,2,3,3,3,3,3,3,3], dtype="float")
n, bins, _ = plt.hist(x, density=1, bins=2)
n /= np.sum(n)  # As advertised
print(n)
print(np.sum(n))
[0.35294118 0.64705882]
1.0

Matplotlib version

  • Operating system: OSX High Sierra
  • Matplotlib version: 2.1.1
  • Matplotlib backend: MacOSX
  • Python version: 3.6.4
  • Other libraries: numpy 1.14.0
  • Installed via: pip3
@ImportanceOfBeingErnest
Copy link
Member

The histogram works as expected. The density argument is explained in the documentation

density : boolean, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. [...]

Hence if you calculate the integral,

print (np.sum(n*np.diff(bins)))

the result is indeed 1.0, as expected.

You might want to search stackoverflow.com on questions where people got confused by the density or normed keyword argument.

@tacaswell
Copy link
Member

@ryanpeach Which documentation did you find misleading?

@dstansby dstansby added the status: needs clarification Issues that need more information to resolve. label Feb 12, 2018
@ryanpeach
Copy link
Author

ryanpeach commented Feb 15, 2018

I guess I don’t understand why np.diff(bins) is ever relevant for the summation, “the counts normalized to form a pdf, ie, the area under the histogram will sum to one” literally means “[the sum of] the first element of the tupple will sum to 1” as the number of bins in a histogram is more like a resolution for the integral than it is an xvalue.

I don’t understand how to generate the desired functionality. Maybe there should be an option for this if it’s such a common misunderstanding.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 15, 2018

I guess what I’m saying is, a pdf doesn’t include the xvalue in integrating to find the normalization constant. It’s the y value that must sum to 1.

@tacaswell
Copy link
Member

tacaswell commented Feb 15, 2018

The area of each bar is (height * width), “the counts normalized to form a pdf, ie, the area (or integral) under the histogram will sum to one” translate to np.sum(n*np.diff(bins)) == 1 (not np.sum(n) == 1).

We changed the kwarg from 'normed', which could reasonably be either one, to 'density' to avoid this ambiguity.

If you want your bars to sum to one (independent of the bin widths) then pass in weights=np.ones(len(data)) / len(data) and density=False.

@tacaswell
Copy link
Member

@ryanpeach How would you re-write that docstring to be clearer to you?

@ImportanceOfBeingErnest
Copy link
Member

The numpy.histogram documentation is a bit more verbose on this part:

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

It also directly provides an example case which makes this case clear:

image

Maybe the matplotlib documentation can link to the numpy docs or take this over?

The comment above

If you want your bars to sum to one (independent of the bin widths) then pass in weights=np.ones(len(data)) / len(data) and density=False.

is also a good one which could be added to the docs.

Alternatively, in order not to blow up the docstrings, what about a dedicated histogram example? It seems there is currently only the pyplot-text example around showing how to use the hist function?


@ryanpeach "a pdf doesn’t include the xvalue in integrating to find the normalization constant" is not correct; any integral takes the independent variable into account, i.e. in ∫ f(x) dx you have dx, which is exactly the bin width.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 17, 2018

Ok, so I just learned the distinction between a pdf and a pmf. Why not just add normed as a flag and make it the pmf? This is a very common feature, and shouldn’t require hacks to perform.

diff = np.diff(bin_edge) if not normed else 1.

@ImportanceOfBeingErnest
Copy link
Member

I guess it will be hard to add another functionality to an argument which has been there for years, even though it being deprecated by now. So instead of normed one could surely think of another name.

However, one problem would be that the histogram can take unequal bin sizes; in this case normalizing to the sum of the counts is rather dangerous. Any such plot would essentially need a label "Attention, this plot has been normalized to the sum of the frequencies independent of the bin width" or so. Hence I'm not sure if adding this normalization will not lead to even more confusion.

At the end why not stick to the above suggestion of

weights=np.ones(len(data)) / len(data), density=False

@efiring
Copy link
Member

efiring commented Feb 18, 2018

For a PMF perhaps you should use a different calculation and plotting method entirely: instead of bins, find the counts of values for all unique (discrete) values in the array, and plot each count at each discrete value, either with a marker or with a vertical line up from the origin.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 18, 2018 via email

@ryanpeach
Copy link
Author

However, one problem would be that the histogram can take unequal bin sizes

Make those two operations mutually exclusive (illegal to use unequal bin sizes during pmf)

what to call it

Call it “mass” for consistency.

@ryanpeach
Copy link
Author

why not stick to the above suggestion

Because I can’t hardly think of a time when a pdf would be useful in the context of a histogram, whereas pmfs are very natural use cases for them, yet we have a flag for the pdf and not one for the pmf.

On top of that, weighting the values individually seems unintuitive, though I can see why it’s correct. It just seems like something the average user shouldn’t have to hack around. I’ll add it if that’s what it takes.

@tacaswell
Copy link
Member

This conversation has gotten more contentious than necessary.

Make those two operations mutually exclusive (illegal to use unequal bin sizes during pmf)

This is pushing on a tension we have through out Matplotlib between methods with simple signatures that do one thing (and try to do it well) and methods with very complex signatures (and internals) that do many different things. Having kwargs that are mutually exclusive or have complex conditional dependencies between them quickly becomes hard to document and maintain (we already have things like this so this is not a theoretical concern). The signature and implementation of hist are already pretty complex, I am extremely wary of adding yet more.

Because I can’t hardly think of a time when a pdf would be useful in the context of a histogram, whereas pmfs are very natural use cases for them, yet we have a flag for the pdf and not one for the pmf.

On top of that, weighting the values individually seems unintuitive, though I can see why it’s correct. It just seems like something the average user shouldn’t have to hack around.

Matplotlib has a wide enough user base that we have to be careful about generalization and appeals to 'intuitive'. From my background (physics), the density normalization seems 'natural' and normalizing by mass seems like a corner case where I am fine using the 'weights' method (but definitely get that my experience is not universal).

I’ll add it if that’s what it takes.

As I said above, I am wary of adding complexity to hist. If you do want to add a mass kwarg, how will it interact with density, normed (which is still there, we did not set a deprecation version), and 'stacked' histograms?

It may also be worth taking this question to numpy (a year or so ago, we had a request for optimal bin estimation in hist which got pushed to numpy and now we have bins='auto' on np.histogram). Currently (I think) in the case of non-stacked histograms we just forward all of the computation down to numpy.

@ImportanceOfBeingErnest

However, one problem would be that the histogram can take unequal bin sizes; in this case normalizing to the sum of the counts is rather dangerous.

How is this different than passing in the weights?

@efiring
Copy link
Member

efiring commented Feb 19, 2018

I don't see any problem with unequal bin sizes; the point of the PMF is that the value for each bin is the probability (or more correctly, the relative frequency) of values landing in that bin; there is no reason the bins have to be the same size.

The big difference between the PMF and the PDF is the way they change with the bin boundaries, given a continuous rather than discrete RV. So long as the bins cover the full range of values in both cases, the basic shape and level of the PDF is independent of the number of bins; it just gets noisier as the number of bins increases. For a PMF, in contrast, the general level is proportional to the bin width. If the bins are not the same size, there is nothing wrong with this--but it has to be kept in mind when interpreting the plot. In the case of a PMF bar plot, it means the visual mass of the bar is not a good indicator of frequency.

Perhaps in an ideal world numpy would have provided a "normalization" kwarg with values of "counts", "pdf", and "pmf", (or "counts", "density", and "fraction") and we would just propagate that into our hist.

@jklymak
Copy link
Member

jklymak commented Feb 19, 2018

I wrote a longer response, but I'll just say that a quick look on Google indicates that PMFs seem to be usually plotted as stem plots, as @efiring suggested above. The definition of a histogram (from wikipedia) is that it represents a finite approximation of a PDF, and thats how I've always understood it. I'd need to see some evidence that there is a large (published) community out there who want the PMF to be plotted as a histogram before I'd support adding a new kwarg.

@ImportanceOfBeingErnest
Copy link
Member

So to summarize:

Documentation a user found the normalization (density) argument confusing. Seen from frequent questions about the topic on stackoverflow, lots of users are similarly confused. I guess reasons range from them having no math background at all, over the word "histogram" being frequently used for just any barplot, to not remembering the definition of a pdf.
First thing would hence be to update the documentation. There are some proposals above and one may agree on any of them.

Extended functionality of hist I guess we cannot judge on every possible use case of the hist function. Independent on whether it makes sense to use it in the case brought up here, it is probably clear that people want to use this to produce plots of the relative frequency of binned values. Just to give an example:

image

Such a plot can be useful to directly grasp the binned probability. E.g. we can directly read that 37% of all products cost between 100 and 140 dollars. This plot can easily produced with weights=np.ones_like(data)*100./len(data). The question is hence simply, whether to let people use this clumsy argument, or whether to add some new argument making it easier for people to create such plots, e.g. plotly does have such an argument histnorm ("" | "percent" | "probability" | "density" | "probability density" ).

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

I think it would be useful to provide visual examples such as that for each of the modes we are discussing. I’m still pretty confused as to any true usecase of the density argument for the histogram function as-is. And a stem-leaf plot? Please provide an example of using that to get a pmf. The usecase @ImportanceOfBeingErnest has documented is almost certainly the normal users intention for using a histogram, and intuitively a histogram would be the first thing such a user would try to get such information visually. And even if we disagree on that subjective polling, we can’t disagree that it’s both valid and popular as a feature request, and that the provided solution at the very least requires a search to discover. Not only that, but it’s easy to implement and consistent with existing implemented kwargs.

@jklymak
Copy link
Member

jklymak commented Feb 20, 2018

@ryanpeach
Copy link
Author

That’s completely unhelpful.

@ryanpeach
Copy link
Author

Sure it shows a stem plot for a pmf, but nothing about a matplotlib implementation.

@ryanpeach
Copy link
Author

And why that shouldn’t just be a matplotlib histogram with the bar widths reduced to zero and a mass kwarg set to true is beyond me.

@jklymak
Copy link
Member

jklymak commented Feb 20, 2018

Because we have stem and bar methods for those representations.

@ryanpeach
Copy link
Author

I suppose. But what you’re asking users to do is to use np.hist, normalize themselves, and then do this https://matplotlib.org/devdocs/gallery/lines_bars_and_markers/stem_plot.html#sphx-glr-gallery-lines-bars-and-markers-stem-plot-py

Whereas a someone coming from an office environment is going to see plt.hist, think about how they might do it in excel, and expect it to just work.

@timhoffm
Copy link
Member

timhoffm commented Feb 20, 2018

You may argue that a stem is just a bar with zero width. However, technically, they are different objects in matplotlib (rectangle vs. line). Thus one as to use the appropriate function.

@timhoffm
Copy link
Member

It might be feasible to add a pmf function or to extend stem, but I don't think it's reasonable to add this to hist.

@ryanpeach
Copy link
Author

Then I would argue that there needs to be something that is the stem version of plt.hist...

Like, why are we wrapping np.hist if not to make this easy.

@efiring
Copy link
Member

efiring commented Feb 20, 2018

I think the stem is appropriate for the specific case that I mentioned much earlier, when there would be a stem for each unique discrete value. It is not appropriate for the case where there are bins holding values on a continuum, or more than a single integer, because it doesn't show the bin boundaries.

@ryanpeach
Copy link
Author

Ok, under histtype we have loads of options for what kind of plot to use. Let users decide, include ‘stem’ in that list, make it automatic if mass is used and histtype is left default.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

True I agree we need to show the bins boundaries for continuous plots. See my suggestion above. Maybe don’t make it automatic.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

Here’s another idea, gleamed from this Wikipedia article https://en.m.wikipedia.org/wiki/Probability_density_function

The top figure is scaled on the xaxis to multiples of sigma. What if we used this math instead (or include the option for it)

1==np.sum(n*np.diff(bin_edges)/np.mean(np.diff(bin_edges))

Which basically cancels and gets rid of our problem of different bin sizes (maybe?)

This would give us scale invariance (which is really the problem with the PDFs as they are now).

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

The issue about the pdf and the pmf is slightly pedantic anyway because there is no such thing as a continuous function in numpy/matplotlib. We are always dealing with discrete floating point numbers, and thus a distribution either with bins=len(np.unique(x)) (stem) or bins<len(np.unique(x)) (bar). Then there is the independent option of density with or without scale invariance. Am I right?

@efiring
Copy link
Member

efiring commented Feb 20, 2018

No, this is not a matter of being pedantic. There is a big difference between looking at frequencies of discrete events (e.g., counts of Fords, Hondas, etc. going through a tollboth in a day) and quantities that are binned such that there are several or many different values being grouped together in each of several bins. The fact that there is a finite number of floating point numbers that can be represented in a computer with a single or double-precision float is irrelevant.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

There is a big difference between looking at frequencies of discrete events (e.g., counts of Fords, Hondas, etc. going through a tollboth in a day) and quantities that are binned such that there are several or many different values being grouped together in each of several bins.

But these aren't mutually exclusive. For instance, my example and @ImportanceOfBeingErnest 's example are binned, continuous, and thus demand a bar plot. However, they do not require normalization via the absolute area under the curve, rather by the normalized xaxis area under the curve. A stem chart won't work for them, because they aren't a selection of a few discrete things, yet the density function won't work for them either, because it cares about the absolute area under the curve. Is that a pdf or a pmf? You tell me.

@ryanpeach
Copy link
Author

ryanpeach commented Feb 20, 2018

I think this inevitably comes down, on a technical level, to an argument over limiting choice within a function that was designed for convienience. All of this could be done by hand, plt.hist could be removed and let everyone just use bar plots or stem plots with np.hist till the end of time. But it's obvious that someone thought including plt.hist was important for convenience, but didn't take into account all the popular ways of using it. Now we are stuck looking at wikipedia definitions to mathematical terms that have at best a passing relation to the data visualization tequnique of a histogram, and the normalization constants commonly associated with one, and are arguing over how a user "should" do normalization, how they "should" visualize probabilities and frequencies, how they "should" bin things. Call those normalization options whatever you want, make pretty and "cannonical" defaults, but not having them because they don't fit a mold (playing favorites) even though plenty of people use them every day is what is pedantic.

@ImportanceOfBeingErnest
Copy link
Member

There is a good principle when it comes to plotting: Keep data aggregation and manipulation separate from data visualization.
Matplotlib provides functions to draw data as lines, points, shapes, bars etc. Once you have aggregated your data, you can decide on which kind of plot works best to visualize this data. This completely lies in the user's responsibility. If the user wants to draw a pmf of his data as a bar plot, he can do so.
So following this principle there should not even be a hist function in the first place - essentially it only saves some typing and is hence convenient to use. Once you start adding all possible use cases of this function as individual arguments you will loose much of the convenience. It may look as if this function is easier to use if some argument does exactly what you want, but you'll soon end up spending more time understanding all the arguments and their implications on the drawn result than you would have spend on aggregating the data in the way you need it and simply calling the line/bar/stem/scatter function.

In view of this, I am currently not sure if there is actually anything matplotlib should do, other than the two points mentioned above: (1) improve documentation (2) decide upon adding an additional frequency normalization argument.
@ryanpeach Despite all the comments added from your side, we are still at exactly the point we were yesterday. Be assured that people did understand that there might be a need to do (1) and (2) and they will decide on that at some point in the future.

@ryanpeach
Copy link
Author

Ok, just let me know. I just hope you don't choose to ignore (2) while maintaining the density kwarg, that's just biased. I'll let the thread linger till you all make a decision but still offer my services in making the pull request.

@afvincent
Copy link
Contributor

FWIW, Numpy has a similar issue thread: numpy/numpy#9921 (but it got significantly less traction, to say the less).

@anntzer
Copy link
Contributor

anntzer commented Feb 20, 2018

I strongly believe mass should not be added (unless numpy adds it, which I would deem a mistake but would begrundingly accept following).

@jklymak
Copy link
Member

jklymak commented Feb 20, 2018

The two definitions are the same if you have equally spaced bins (which is the most usual situation). I’d agree that if numpy accepts mass we can revisit this and follow their lead. @ryanpeach if you want to continue discussion, you should chime in on the numpy issue linked above, or of course another dev is free to re-open this if they disagree...

@jklymak jklymak closed this as completed Feb 20, 2018
@afvincent
Copy link
Contributor

afvincent commented Feb 21, 2018

A more or less recent decision was to make plt.hist a thinner (plotting) wrapper around np.histogram than it used to be (because the first one predated the latter IIRC), was it not? So I would suggest that if at some point we want to provide a way to plot PMF-like objects through plt.hist, then it may be a reasonable idea to work with Numpy devs to see if it can make it in np.histogram.

FWIW, after a bit of googling, it seems like some (and not most of those that I checked¹) plotting softwares offer the feature that is discussed here through a keyword argument like “norm”/“density”/etc. that can take string inputs among “pdf”, “cdf”, “counts”, “probability” (usually what is discussed here), “sf”, etc.

https://www.mathworks.com/help/matlab/ref/histogram.html
http://reference.wolfram.com/language/ref/Histogram.html
https://www.rdocumentation.org/packages/graphics/versions/3.4.3/topics/hist
https://www.originlab.com/doc/Origin-Help/Create-Histogram

Edit: burnt ^^.

@ryanpeach
Copy link
Author

Just posted on the numpy thread.

I don't think it's numpy's job to do this because normalization is easy to do mathematically post-hoc. It's just the graphing that's hard. It's matplotlibs wrapper that makes it difficult to do intermediate steps on the binning like normalization. But I digress.

@nunocalaim
Copy link

nunocalaim commented Aug 29, 2018

So I came here because I have a histogram with log=True and density=True and it is giving me weird results. Note that I have logarithmic spaced bins
captura de ecra 2018-08-29 as 18 59 22

Note how on some histograms the bins, for which supposedly there are no counts, still have some height to them (is appearing above 0). Ideally this is not what I was looking for. I feel this discussion is related to this problem. And I am unsure if this is a bug or intended.

@ImportanceOfBeingErnest
Copy link
Member

@nunocalaim Thanks for reaching out. However I do not think your issue is in any way related. Would you mind opening a new issue and include a self-contained example (i.e. a code one can copy, paste and run) to allow people to reproduce your problem? This is the only way one can find out if the outcome is expected or whether there is a problem in the matplotlib code base.

@jklymak
Copy link
Member

jklymak commented Nov 24, 2020

Note that we now have stair which allows folks to normalize whatever way is most appropriate for their field. #18275

I'm pretty concerned with Matplotlib mediating multiple normalization conventions among fields. But if someone can come up with a clean way to accommodate everyone, or write their own wrapper, I imagine we could consider that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs clarification Issues that need more information to resolve.
Projects
None yet
Development

No branches or pull requests

10 participants