-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
hist(): align!='mid' and unequally spaced bins don't play well together. #12524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2315e1b
to
6de5e10
Compare
@@ -6580,6 +6580,12 @@ def hist(self, x, bins=None, range=None, density=None, weights=None, | |||
mlast[:] = m | |||
tops.append(m) | |||
|
|||
# align != 'mid' only makes sense for equal-sized bins. | |||
if align != 'mid' and not cbook._is_equally_spaced(bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should only apply to the step-like functions, so if histtype in ['step', 'stepfilled']
...
For the bar-like functions, it may not make too much sense, but at least it's not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #12494 shows very clearly that whoever implemented align
did not take non-equally-spaced bins into account, and while the result is "not wrong", it doesn't make much sense either.
Actually, for unequal spacing and bar, it does what it says in the docs. But that result is misleading at best:
OTOH I don't see a use-case where this could be desired. Therefore +1 on not allowing 'align' with variable spacing. |
lib/matplotlib/axes/_axes.py
Outdated
# align != 'mid' only makes sense for equal-sized bins. | ||
if align != 'mid' and not cbook._is_equally_spaced(bins): | ||
raise ValueError( | ||
"When 'bins' are not equally spaced, the only valid alignment " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention align
in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
6de5e10
to
0e27266
Compare
the |
In a usual plot the bar should cover the range There's one exception with singled-value bins. For an int list like The whole concept of align is ill-defined for histograms. Instead it would have been better to define
|
Well technically 'edge' and 'center' are not valid values for 'align' yet, so we could introduce them with these semantics and deprecate 'left'/'mid'/'right' at the same time, but I don't care enough to fight that API change fight... |
Well, if there was some change, we would have to deprecate the |
'align' already exists for 'bar' essentially with that same meaning, there's also benefit in not changing that. |
Would it maybe make sense to plot a cumulative distribution with the bar on the right side of the bin edges (especially with rwidth != 1)? What happens if you want to plot such distribution on a log scale? Would using unequal bin sizes make sense in such case? |
You would want a CDF with right-aligned bars, however, you need quite strange input data to get a CDF from histogram binning. I think CDF should be plotted with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promoting this as 'better than before'. If we want to do a fundamental change to the naming/behavior can be discussed independently.
It's not totally clear to me why you'd want to use align='right' with cdfs, but even if you do (and more generally, if some kwargs only make sense in combination with some other kwargs, and give "questionable" results otherwise), I'd consider this mostly suggests that the functionality should be split into a separate function. |
As an example:
Or, on a log scale with unequally sized bins,
|
Still, it’s quite unusual to have data in a way that a histogram of them would be a CDF. Histograms are naturally linked with PDFs (count the number of values in that interval). If you want hist to be a CDF, your hist input data would need to have n+m values in that interval, where n is the number of values of actual data you want to describe, and m is the number of values that are smaller than the lower interval border. May sound a bit strange, but that’s the point. Think about it. Histograms and CDFs are different pairs of shoes. |
Agree w/ @timhoffm rant. Don't agree that CDF's shoudl be right-aligned either though, unless they are caclulated in some incorrect way. The value of the bar should be the sum of all occurences to the left of the right hand side of the bar. |
Let's please stop the discussion on CDFs here. It's leading off topic.
Happy to discuss possible API changes to |
.. anyhoo, this needs a cbook and a hist test. And an API change note, I think, since there may have been folks who were doing the other alignements on uneven bins.... |
As usual, feel free to push the test, I honestly don't think it adds much. |
I guess we should discuss the test policy - at some point I thought we were talking about aiming for 100% coverage, but maybe I misunderstood. I agree these tests would be trivial, but OTOH, these tests would be trivial... |
Since this PR is an API change, we can discuss it here, not elsewhere. Sorry for using the notion CDF. Of course a histogram never shows a CDF, but it can approximate the value of a CDF at certain positions (which is shown in the plots above). I don't think there is an error in the way The above applies to the bar-like histtypes. Concerning the step-like types, there actually is a problem, because the result is wrong. For this case, throwing an error is fine, as it prevents non-sensical output and thereby helps the user. |
@ImportanceOfBeingErnest sorry, I did not read your example carefully (wasn't aware of the |
0e27266
to
66caed5
Compare
Added test. Need to think more about the cdf discussion above... |
Also xref'ing #11506 re: align+rwidth+log. |
Marking WIP, feel free to clear if you are done thinking about CDFs, and go ahead and ping me and I'll review... |
66caed5
to
6242747
Compare
Did the rebase, but then read through the discussion and see that there is still some concern over CDFs, pushing to 3.3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revoke my approval of the current version. I'm not sure that it does not affact valid use cases.
I'm not sure that this does not affect valid use cases.
Since this Pull Request has not been updated in 60 days, it has been marked "inactive." This does not mean that it will be closed, though it may be moved to a "Draft" state. This helps maintainers prioritize their reviewing efforts. You can pick the PR back up anytime - please ping us if you need a review or guidance to move the PR forward! If you do not plan on continuing the work, please let us know so that we can either find someone to take the PR over, or close it. |
Let's just track this at #12494. |
PR Summary
See example in #12494.
PR Checklist