plt.hist density argument does not function as described. #10398

ryanpeach · 2018-02-08T15:44:45Z

Bug report

Bug summary

plt.hist density argument does not function as described.

Code for reproduction

import matplotlib.pyplot as plt
import numpy as np
x = np.asarray([0,0,0,0,1,1,2,2,2,2,3,3,3,3,3,3,3], dtype="float")
n, bins, _ = plt.hist(x, density=1, bins=2)
print(n)
print(np.sum(n))

Actual outcome

[0.23529412 0.43137255]
0.6666666666666666

Expected outcome

import matplotlib.pyplot as plt
import numpy as np
x = np.asarray([0,0,0,0,1,1,2,2,2,2,3,3,3,3,3,3,3], dtype="float")
n, bins, _ = plt.hist(x, density=1, bins=2)
n /= np.sum(n)  # As advertised
print(n)
print(np.sum(n))

[0.35294118 0.64705882]
1.0

Matplotlib version

Operating system: OSX High Sierra
Matplotlib version: 2.1.1
Matplotlib backend: MacOSX
Python version: 3.6.4
Other libraries: numpy 1.14.0
Installed via: pip3

The text was updated successfully, but these errors were encountered:

ImportanceOfBeingErnest · 2018-02-08T17:22:37Z

The histogram works as expected. The density argument is explained in the documentation

density : boolean, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. [...]

Hence if you calculate the integral,

print (np.sum(n*np.diff(bins)))

the result is indeed 1.0, as expected.

You might want to search stackoverflow.com on questions where people got confused by the density or normed keyword argument.

tacaswell · 2018-02-08T20:54:56Z

@ryanpeach Which documentation did you find misleading?

ryanpeach · 2018-02-15T18:11:22Z

I guess I don’t understand why np.diff(bins) is ever relevant for the summation, “the counts normalized to form a pdf, ie, the area under the histogram will sum to one” literally means “[the sum of] the first element of the tupple will sum to 1” as the number of bins in a histogram is more like a resolution for the integral than it is an xvalue.

I don’t understand how to generate the desired functionality. Maybe there should be an option for this if it’s such a common misunderstanding.

ryanpeach · 2018-02-15T18:19:35Z

I guess what I’m saying is, a pdf doesn’t include the xvalue in integrating to find the normalization constant. It’s the y value that must sum to 1.

tacaswell · 2018-02-15T18:41:19Z

The area of each bar is (height * width), “the counts normalized to form a pdf, ie, the area (or integral) under the histogram will sum to one” translate to np.sum(n*np.diff(bins)) == 1 (not np.sum(n) == 1).

We changed the kwarg from 'normed', which could reasonably be either one, to 'density' to avoid this ambiguity.

If you want your bars to sum to one (independent of the bin widths) then pass in weights=np.ones(len(data)) / len(data) and density=False.

tacaswell · 2018-02-15T18:42:50Z

@ryanpeach How would you re-write that docstring to be clearer to you?

ImportanceOfBeingErnest · 2018-02-15T23:31:32Z

The numpy.histogram documentation is a bit more verbose on this part:

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

It also directly provides an example case which makes this case clear:

Maybe the matplotlib documentation can link to the numpy docs or take this over?

The comment above

If you want your bars to sum to one (independent of the bin widths) then pass in weights=np.ones(len(data)) / len(data) and density=False.

is also a good one which could be added to the docs.

Alternatively, in order not to blow up the docstrings, what about a dedicated histogram example? It seems there is currently only the pyplot-text example around showing how to use the hist function?

@ryanpeach "a pdf doesn’t include the xvalue in integrating to find the normalization constant" is not correct; any integral takes the independent variable into account, i.e. in ∫ f(x) dx you have dx, which is exactly the bin width.

ryanpeach · 2018-02-17T23:10:33Z

Ok, so I just learned the distinction between a pdf and a pmf. Why not just add normed as a flag and make it the pmf? This is a very common feature, and shouldn’t require hacks to perform.

diff = np.diff(bin_edge) if not normed else 1.

ImportanceOfBeingErnest · 2018-02-17T23:48:16Z

I guess it will be hard to add another functionality to an argument which has been there for years, even though it being deprecated by now. So instead of normed one could surely think of another name.

However, one problem would be that the histogram can take unequal bin sizes; in this case normalizing to the sum of the counts is rather dangerous. Any such plot would essentially need a label "Attention, this plot has been normalized to the sum of the frequencies independent of the bin width" or so. Hence I'm not sure if adding this normalization will not lead to even more confusion.

At the end why not stick to the above suggestion of

weights=np.ones(len(data)) / len(data), density=False

efiring · 2018-02-18T01:44:06Z

For a PMF perhaps you should use a different calculation and plotting method entirely: instead of bins, find the counts of values for all unique (discrete) values in the array, and plot each count at each discrete value, either with a marker or with a vertical line up from the origin.

ryanpeach · 2018-02-18T04:02:47Z

Eric, that’s a good idea, except for when you are dealing with semi-continuous things like integers, or continuous values that you still want divided into bins. Like, if you have 100 samples of a continuous distribution, your distribution will not look continuous without bins.

ryanpeach · 2018-02-18T04:06:00Z

However, one problem would be that the histogram can take unequal bin sizes

Make those two operations mutually exclusive (illegal to use unequal bin sizes during pmf)

what to call it

Call it “mass” for consistency.

ryanpeach · 2018-02-18T04:11:26Z

why not stick to the above suggestion

Because I can’t hardly think of a time when a pdf would be useful in the context of a histogram, whereas pmfs are very natural use cases for them, yet we have a flag for the pdf and not one for the pmf.

On top of that, weighting the values individually seems unintuitive, though I can see why it’s correct. It just seems like something the average user shouldn’t have to hack around. I’ll add it if that’s what it takes.

tacaswell · 2018-02-18T17:20:36Z

This conversation has gotten more contentious than necessary.

Make those two operations mutually exclusive (illegal to use unequal bin sizes during pmf)

This is pushing on a tension we have through out Matplotlib between methods with simple signatures that do one thing (and try to do it well) and methods with very complex signatures (and internals) that do many different things. Having kwargs that are mutually exclusive or have complex conditional dependencies between them quickly becomes hard to document and maintain (we already have things like this so this is not a theoretical concern). The signature and implementation of hist are already pretty complex, I am extremely wary of adding yet more.

Because I can’t hardly think of a time when a pdf would be useful in the context of a histogram, whereas pmfs are very natural use cases for them, yet we have a flag for the pdf and not one for the pmf.

On top of that, weighting the values individually seems unintuitive, though I can see why it’s correct. It just seems like something the average user shouldn’t have to hack around.

Matplotlib has a wide enough user base that we have to be careful about generalization and appeals to 'intuitive'. From my background (physics), the density normalization seems 'natural' and normalizing by mass seems like a corner case where I am fine using the 'weights' method (but definitely get that my experience is not universal).

I’ll add it if that’s what it takes.

As I said above, I am wary of adding complexity to hist. If you do want to add a mass kwarg, how will it interact with density, normed (which is still there, we did not set a deprecation version), and 'stacked' histograms?

It may also be worth taking this question to numpy (a year or so ago, we had a request for optimal bin estimation in hist which got pushed to numpy and now we have bins='auto' on np.histogram). Currently (I think) in the case of non-stacked histograms we just forward all of the computation down to numpy.

@ImportanceOfBeingErnest

However, one problem would be that the histogram can take unequal bin sizes; in this case normalizing to the sum of the counts is rather dangerous.

How is this different than passing in the weights?

efiring · 2018-02-19T00:58:47Z

I don't see any problem with unequal bin sizes; the point of the PMF is that the value for each bin is the probability (or more correctly, the relative frequency) of values landing in that bin; there is no reason the bins have to be the same size.

The big difference between the PMF and the PDF is the way they change with the bin boundaries, given a continuous rather than discrete RV. So long as the bins cover the full range of values in both cases, the basic shape and level of the PDF is independent of the number of bins; it just gets noisier as the number of bins increases. For a PMF, in contrast, the general level is proportional to the bin width. If the bins are not the same size, there is nothing wrong with this--but it has to be kept in mind when interpreting the plot. In the case of a PMF bar plot, it means the visual mass of the bar is not a good indicator of frequency.

Perhaps in an ideal world numpy would have provided a "normalization" kwarg with values of "counts", "pdf", and "pmf", (or "counts", "density", and "fraction") and we would just propagate that into our hist.

jklymak · 2018-02-19T01:17:41Z

I wrote a longer response, but I'll just say that a quick look on Google indicates that PMFs seem to be usually plotted as stem plots, as @efiring suggested above. The definition of a histogram (from wikipedia) is that it represents a finite approximation of a PDF, and thats how I've always understood it. I'd need to see some evidence that there is a large (published) community out there who want the PMF to be plotted as a histogram before I'd support adding a new kwarg.

ImportanceOfBeingErnest · 2018-02-19T02:14:33Z

So to summarize:

Documentation a user found the normalization (density) argument confusing. Seen from frequent questions about the topic on stackoverflow, lots of users are similarly confused. I guess reasons range from them having no math background at all, over the word "histogram" being frequently used for just any barplot, to not remembering the definition of a pdf.
First thing would hence be to update the documentation. There are some proposals above and one may agree on any of them.

Extended functionality of hist I guess we cannot judge on every possible use case of the hist function. Independent on whether it makes sense to use it in the case brought up here, it is probably clear that people want to use this to produce plots of the relative frequency of binned values. Just to give an example:

Such a plot can be useful to directly grasp the binned probability. E.g. we can directly read that 37% of all products cost between 100 and 140 dollars. This plot can easily produced with weights=np.ones_like(data)*100./len(data). The question is hence simply, whether to let people use this clumsy argument, or whether to add some new argument making it easier for people to create such plots, e.g. plotly does have such an argument histnorm ("" | "percent" | "probability" | "density" | "probability density" ).

ryanpeach · 2018-02-20T20:49:59Z

I think it would be useful to provide visual examples such as that for each of the modes we are discussing. I’m still pretty confused as to any true usecase of the density argument for the histogram function as-is. And a stem-leaf plot? Please provide an example of using that to get a pmf. The usecase @ImportanceOfBeingErnest has documented is almost certainly the normal users intention for using a histogram, and intuitively a histogram would be the first thing such a user would try to get such information visually. And even if we disagree on that subjective polling, we can’t disagree that it’s both valid and popular as a feature request, and that the provided solution at the very least requires a search to discover. Not only that, but it’s easy to implement and consistent with existing implemented kwargs.

jklymak · 2018-02-20T20:55:46Z

https://en.m.wikipedia.org/wiki/Probability_mass_function

ryanpeach · 2018-02-20T21:02:23Z

That’s completely unhelpful.

ryanpeach · 2018-02-20T21:03:23Z

Sure it shows a stem plot for a pmf, but nothing about a matplotlib implementation.

ryanpeach · 2018-02-20T21:04:39Z

And why that shouldn’t just be a matplotlib histogram with the bar widths reduced to zero and a mass kwarg set to true is beyond me.

jklymak · 2018-02-20T21:08:50Z

Because we have stem and bar methods for those representations.

ryanpeach · 2018-02-20T21:11:10Z

I suppose. But what you’re asking users to do is to use np.hist, normalize themselves, and then do this https://matplotlib.org/devdocs/gallery/lines_bars_and_markers/stem_plot.html#sphx-glr-gallery-lines-bars-and-markers-stem-plot-py

Whereas a someone coming from an office environment is going to see plt.hist, think about how they might do it in excel, and expect it to just work.

timhoffm · 2018-02-20T21:11:46Z

You may argue that a stem is just a bar with zero width. However, technically, they are different objects in matplotlib (rectangle vs. line). Thus one as to use the appropriate function.

timhoffm · 2018-02-20T21:14:34Z

It might be feasible to add a pmf function or to extend stem, but I don't think it's reasonable to add this to hist.

ryanpeach · 2018-02-20T21:14:45Z

Then I would argue that there needs to be something that is the stem version of plt.hist...

Like, why are we wrapping np.hist if not to make this easy.

efiring · 2018-02-20T21:18:27Z

I think the stem is appropriate for the specific case that I mentioned much earlier, when there would be a stem for each unique discrete value. It is not appropriate for the case where there are bins holding values on a continuum, or more than a single integer, because it doesn't show the bin boundaries.

ryanpeach · 2018-02-20T21:20:20Z

Ok, under histtype we have loads of options for what kind of plot to use. Let users decide, include ‘stem’ in that list, make it automatic if mass is used and histtype is left default.

ryanpeach · 2018-02-20T21:21:12Z

True I agree we need to show the bins boundaries for continuous plots. See my suggestion above. Maybe don’t make it automatic.

ryanpeach · 2018-02-20T21:30:59Z

Here’s another idea, gleamed from this Wikipedia article https://en.m.wikipedia.org/wiki/Probability_density_function

The top figure is scaled on the xaxis to multiples of sigma. What if we used this math instead (or include the option for it)

1==np.sum(n*np.diff(bin_edges)/np.mean(np.diff(bin_edges))

Which basically cancels and gets rid of our problem of different bin sizes (maybe?)

This would give us scale invariance (which is really the problem with the PDFs as they are now).

ryanpeach · 2018-02-20T21:45:21Z

The issue about the pdf and the pmf is slightly pedantic anyway because there is no such thing as a continuous function in numpy/matplotlib. We are always dealing with discrete floating point numbers, and thus a distribution either with bins=len(np.unique(x)) (stem) or bins<len(np.unique(x)) (bar). Then there is the independent option of density with or without scale invariance. Am I right?

efiring · 2018-02-20T22:33:45Z

No, this is not a matter of being pedantic. There is a big difference between looking at frequencies of discrete events (e.g., counts of Fords, Hondas, etc. going through a tollboth in a day) and quantities that are binned such that there are several or many different values being grouped together in each of several bins. The fact that there is a finite number of floating point numbers that can be represented in a computer with a single or double-precision float is irrelevant.

ryanpeach · 2018-02-20T23:16:31Z

There is a big difference between looking at frequencies of discrete events (e.g., counts of Fords, Hondas, etc. going through a tollboth in a day) and quantities that are binned such that there are several or many different values being grouped together in each of several bins.

But these aren't mutually exclusive. For instance, my example and @ImportanceOfBeingErnest 's example are binned, continuous, and thus demand a bar plot. However, they do not require normalization via the absolute area under the curve, rather by the normalized xaxis area under the curve. A stem chart won't work for them, because they aren't a selection of a few discrete things, yet the density function won't work for them either, because it cares about the absolute area under the curve. Is that a pdf or a pmf? You tell me.

ryanpeach · 2018-02-20T23:29:57Z

I think this inevitably comes down, on a technical level, to an argument over limiting choice within a function that was designed for convienience. All of this could be done by hand, plt.hist could be removed and let everyone just use bar plots or stem plots with np.hist till the end of time. But it's obvious that someone thought including plt.hist was important for convenience, but didn't take into account all the popular ways of using it. Now we are stuck looking at wikipedia definitions to mathematical terms that have at best a passing relation to the data visualization tequnique of a histogram, and the normalization constants commonly associated with one, and are arguing over how a user "should" do normalization, how they "should" visualize probabilities and frequencies, how they "should" bin things. Call those normalization options whatever you want, make pretty and "cannonical" defaults, but not having them because they don't fit a mold (playing favorites) even though plenty of people use them every day is what is pedantic.

ImportanceOfBeingErnest · 2018-02-20T23:36:00Z

There is a good principle when it comes to plotting: Keep data aggregation and manipulation separate from data visualization.
Matplotlib provides functions to draw data as lines, points, shapes, bars etc. Once you have aggregated your data, you can decide on which kind of plot works best to visualize this data. This completely lies in the user's responsibility. If the user wants to draw a pmf of his data as a bar plot, he can do so.
So following this principle there should not even be a hist function in the first place - essentially it only saves some typing and is hence convenient to use. Once you start adding all possible use cases of this function as individual arguments you will loose much of the convenience. It may look as if this function is easier to use if some argument does exactly what you want, but you'll soon end up spending more time understanding all the arguments and their implications on the drawn result than you would have spend on aggregating the data in the way you need it and simply calling the line/bar/stem/scatter function.

In view of this, I am currently not sure if there is actually anything matplotlib should do, other than the two points mentioned above: (1) improve documentation (2) decide upon adding an additional frequency normalization argument.
@ryanpeach Despite all the comments added from your side, we are still at exactly the point we were yesterday. Be assured that people did understand that there might be a need to do (1) and (2) and they will decide on that at some point in the future.

ryanpeach · 2018-02-20T23:39:06Z

Ok, just let me know. I just hope you don't choose to ignore (2) while maintaining the density kwarg, that's just biased. I'll let the thread linger till you all make a decision but still offer my services in making the pull request.

afvincent · 2018-02-20T23:45:17Z

FWIW, Numpy has a similar issue thread: numpy/numpy#9921 (but it got significantly less traction, to say the less).

anntzer · 2018-02-20T23:52:42Z

I strongly believe mass should not be added (unless numpy adds it, which I would deem a mistake but would begrundingly accept following).

jklymak · 2018-02-20T23:58:59Z

The two definitions are the same if you have equally spaced bins (which is the most usual situation). I’d agree that if numpy accepts mass we can revisit this and follow their lead. @ryanpeach if you want to continue discussion, you should chime in on the numpy issue linked above, or of course another dev is free to re-open this if they disagree...

afvincent · 2018-02-21T00:05:38Z

A more or less recent decision was to make plt.hist a thinner (plotting) wrapper around np.histogram than it used to be (because the first one predated the latter IIRC), was it not? So I would suggest that if at some point we want to provide a way to plot PMF-like objects through plt.hist, then it may be a reasonable idea to work with Numpy devs to see if it can make it in np.histogram.

FWIW, after a bit of googling, it seems like some (and not most of those that I checked¹) plotting softwares offer the feature that is discussed here through a keyword argument like “norm”/“density”/etc. that can take string inputs among “pdf”, “cdf”, “counts”, “probability” (usually what is discussed here), “sf”, etc.

https://www.mathworks.com/help/matlab/ref/histogram.html
http://reference.wolfram.com/language/ref/Histogram.html
https://www.rdocumentation.org/packages/graphics/versions/3.4.3/topics/hist
https://www.originlab.com/doc/Origin-Help/Create-Histogram

Edit: burnt ^^.

ryanpeach · 2018-02-21T00:30:17Z

Just posted on the numpy thread.

I don't think it's numpy's job to do this because normalization is easy to do mathematically post-hoc. It's just the graphing that's hard. It's matplotlibs wrapper that makes it difficult to do intermediate steps on the binning like normalization. But I digress.

nunocalaim · 2018-08-29T18:01:22Z

So I came here because I have a histogram with log=True and density=True and it is giving me weird results. Note that I have logarithmic spaced bins

Note how on some histograms the bins, for which supposedly there are no counts, still have some height to them (is appearing above 0). Ideally this is not what I was looking for. I feel this discussion is related to this problem. And I am unsure if this is a bug or intended.

ImportanceOfBeingErnest · 2018-08-29T20:15:22Z

@nunocalaim Thanks for reaching out. However I do not think your issue is in any way related. Would you mind opening a new issue and include a self-contained example (i.e. a code one can copy, paste and run) to allow people to reproduce your problem? This is the only way one can find out if the outcome is expected or whether there is a problem in the matplotlib code base.

jklymak · 2020-11-24T16:27:12Z

Note that we now have stair which allows folks to normalize whatever way is most appropriate for their field. #18275

I'm pretty concerned with Matplotlib mediating multiple normalization conventions among fields. But if someone can come up with a clean way to accommodate everyone, or write their own wrapper, I imagine we could consider that.

dstansby added the status: needs clarification Issues that need more information to resolve. label Feb 12, 2018

jklymak closed this as completed Feb 20, 2018

eric-wieser mentioned this issue Jun 13, 2018

Add mass keyword to np.hist() numpy/numpy#9921

Open

plt.hist density argument does not function as described. #10398

plt.hist density argument does not function as described. #10398

Comments

ryanpeach commented Feb 8, 2018

Bug report

ImportanceOfBeingErnest commented Feb 8, 2018

tacaswell commented Feb 8, 2018

ryanpeach commented Feb 15, 2018 • edited Loading

ryanpeach commented Feb 15, 2018 • edited Loading

tacaswell commented Feb 15, 2018 • edited Loading

tacaswell commented Feb 15, 2018

ImportanceOfBeingErnest commented Feb 15, 2018

ryanpeach commented Feb 17, 2018 • edited Loading

ImportanceOfBeingErnest commented Feb 17, 2018

efiring commented Feb 18, 2018

ryanpeach commented Feb 18, 2018 via email • edited Loading

ryanpeach commented Feb 18, 2018

ryanpeach commented Feb 18, 2018

tacaswell commented Feb 18, 2018

efiring commented Feb 19, 2018

jklymak commented Feb 19, 2018

ImportanceOfBeingErnest commented Feb 19, 2018

ryanpeach commented Feb 20, 2018 • edited Loading

jklymak commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

jklymak commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

timhoffm commented Feb 20, 2018 • edited Loading

timhoffm commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

efiring commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

ryanpeach commented Feb 20, 2018 • edited Loading

ryanpeach commented Feb 20, 2018 • edited Loading

ryanpeach commented Feb 20, 2018 • edited Loading

efiring commented Feb 20, 2018

ryanpeach commented Feb 20, 2018 • edited Loading

ryanpeach commented Feb 20, 2018 • edited Loading

ImportanceOfBeingErnest commented Feb 20, 2018

ryanpeach commented Feb 20, 2018

afvincent commented Feb 20, 2018

anntzer commented Feb 20, 2018

jklymak commented Feb 20, 2018

afvincent commented Feb 21, 2018 • edited by jklymak Loading

ryanpeach commented Feb 21, 2018

nunocalaim commented Aug 29, 2018 • edited Loading

ImportanceOfBeingErnest commented Aug 29, 2018

jklymak commented Nov 24, 2020

ryanpeach commented Feb 15, 2018 •

edited

Loading

ryanpeach commented Feb 15, 2018 •

edited

Loading

tacaswell commented Feb 15, 2018 •

edited

Loading

ryanpeach commented Feb 17, 2018 •

edited

Loading

ryanpeach commented Feb 18, 2018 via email •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

timhoffm commented Feb 20, 2018 •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

ryanpeach commented Feb 20, 2018 •

edited

Loading

afvincent commented Feb 21, 2018 •

edited by jklymak

Loading

nunocalaim commented Aug 29, 2018 •

edited

Loading