Skip to content

hist(): align!='mid' and unequally spaced bins don't play well together. #12524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Oct 14, 2018

PR Summary

See example in #12494.

PR Checklist

  • Has Pytest style unit tests
  • Code is Flake 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@anntzer anntzer added this to the v3.1 milestone Oct 14, 2018
@anntzer anntzer force-pushed the unequal-bins-align branch from 2315e1b to 6de5e10 Compare October 14, 2018 12:21
@@ -6580,6 +6580,12 @@ def hist(self, x, bins=None, range=None, density=None, weights=None,
mlast[:] = m
tops.append(m)

# align != 'mid' only makes sense for equal-sized bins.
if align != 'mid' and not cbook._is_equally_spaced(bins):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only apply to the step-like functions, so if histtype in ['step', 'stepfilled'] ...
For the bar-like functions, it may not make too much sense, but at least it's not wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #12494 shows very clearly that whoever implemented align did not take non-equally-spaced bins into account, and while the result is "not wrong", it doesn't make much sense either.

@timhoffm
Copy link
Member

timhoffm commented Oct 14, 2018

Actually, for unequal spacing and bar, it does what it says in the docs. But that result is misleading at best:

import matplotlib.pyplot as plt

t = np.hstack([
    np.linspace(0, 0.99, 2),
    np.linspace(1, 1.99, 5),
    np.linspace(2, 2.99, 3),
    np.linspace(3, 3.99, 2),
])

fig, axs = plt.subplots(3, 1, sharex=True)
axs[0].hist(t, [0, 1, 3, 4], ec='darkblue', lw=3, align='left')
axs[1].hist(t, [0, 1, 3, 4], ec='darkblue', lw=3, align='mid')
axs[2].hist(t, [0, 1, 3, 4], ec='darkblue', lw=3, align='right')

grafik

  1. The actual width of a bar can be hidden (the first bar in the 'left' example)
  2. There can be places where there is data, but no bar value: 2.99 is in the data but no bar is covering that position in 'right'.
  3. Multiple bars can overlap (3.5-4 in 'right'). What's that supposed to mean in a histogram?

OTOH I don't see a use-case where this could be desired.

Therefore +1 on not allowing 'align' with variable spacing.

# align != 'mid' only makes sense for equal-sized bins.
if align != 'mid' and not cbook._is_equally_spaced(bins):
raise ValueError(
"When 'bins' are not equally spaced, the only valid alignment "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention align in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@anntzer anntzer force-pushed the unequal-bins-align branch from 6de5e10 to 0e27266 Compare October 14, 2018 18:56
@tacaswell
Copy link
Member

the align kwarg on hist is deeply confusing as it has different behavior than the align kwarg on bar...

@timhoffm
Copy link
Member

<rant>
Additional remark. I don't see a use-case for align='right' even with equally spaced bins.

In a usual plot the bar should cover the range x_i, x_i+1 because that's the range it actually takes data from.

There's one exception with singled-value bins. For an int list like [0, 0, 1, 1, 1, 2, 2], you would want the "1" bin to be centered on the one. That's what align='left' does.

The whole concept of align is ill-defined for histograms. Instead it would have been better to define

    bintype : {'edge', 'center'}
        Interpret *bins* either as the edges or as the center positions of the bars.

</rant>

@anntzer
Copy link
Contributor Author

anntzer commented Oct 14, 2018

Well technically 'edge' and 'center' are not valid values for 'align' yet, so we could introduce them with these semantics and deprecate 'left'/'mid'/'right' at the same time, but I don't care enough to fight that API change fight...

@timhoffm
Copy link
Member

timhoffm commented Oct 14, 2018

Well, if there was some change, we would have to deprecate the align argument completely and introduce something like bintype (better names welcome). It would be quite confusing to deprecate align='mid' and introduce align='center' which has another effect. Moreover, this is not about alignment but about defining how the bins borders are determined.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 14, 2018

'align' already exists for 'bar' essentially with that same meaning, there's also benefit in not changing that.

@ImportanceOfBeingErnest
Copy link
Member

Would it maybe make sense to plot a cumulative distribution with the bar on the right side of the bin edges (especially with rwidth != 1)? What happens if you want to plot such distribution on a log scale? Would using unequal bin sizes make sense in such case?

@timhoffm
Copy link
Member

Would it maybe make sense to plot a cumulative distribution with the bar on the right side of the bin edges (especially with rwidth != 1)?

You would want a CDF with right-aligned bars, however, you need quite strange input data to get a CDF from histogram binning. I think CDF should be plotted with bar().

timhoffm
timhoffm previously approved these changes Oct 14, 2018
Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoting this as 'better than before'. If we want to do a fundamental change to the naming/behavior can be discussed independently.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 14, 2018

It's not totally clear to me why you'd want to use align='right' with cdfs, but even if you do (and more generally, if some kwargs only make sense in combination with some other kwargs, and give "questionable" results otherwise), I'd consider this mostly suggests that the functionality should be split into a separate function.
IOW APIs of the form "this kwarg value only makes sense in combination with that kwarg value" should be avoided.

@ImportanceOfBeingErnest
Copy link
Member

As an example:

image

import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt

a = np.random.exponential(3, size=300)

fig, axes = plt.subplots(nrows=3, sharex=True)
for align, ax in zip(["left", "mid", "right"], axes):    
    ax.hist(a, bins=range(int(a.max()+2)), cumulative=True, align=align, rwidth=0.3)
    ax.plot(sorted(a), np.arange(len(a)))
    ax.set_title(align)

plt.tight_layout()
plt.show()

Or, on a log scale with unequally sized bins,

image

fig, axes = plt.subplots(nrows=3, sharex=True)
for align, ax in zip(["left", "mid", "right"], axes):    
    ax.hist(a, bins=10**(np.linspace(np.log10(1e-9),np.log10(max(a)))), 
               cumulative=True, align=align, rwidth=0.3)
    ax.plot(sorted(a), np.arange(len(a)))
    ax.set_xscale("log")
    ax.set_title(align)
    ax.set_xlim(1e-2,None)

plt.tight_layout()
plt.show()

@timhoffm
Copy link
Member

Still, it’s quite unusual to have data in a way that a histogram of them would be a CDF. Histograms are naturally linked with PDFs (count the number of values in that interval). If you want hist to be a CDF, your hist input data would need to have n+m values in that interval, where n is the number of values of actual data you want to describe, and m is the number of values that are smaller than the lower interval border. May sound a bit strange, but that’s the point. Think about it. Histograms and CDFs are different pairs of shoes.

@jklymak
Copy link
Member

jklymak commented Oct 15, 2018

Agree w/ @timhoffm rant. Don't agree that CDF's shoudl be right-aligned either though, unless they are caclulated in some incorrect way. The value of the bar should be the sum of all occurences to the left of the right hand side of the bar.

@timhoffm
Copy link
Member

Let's please stop the discussion on CDFs here. It's leading off topic.

  1. I'm repeating myself. CDFs are not created via binning. They are irrelevant for hist().
  2. This particular PR does only deactivate a parameter combination, that hist() was not intended for (and that does currently produce only misleading results).

Happy to discuss possible API changes to hist() in a separate issue/PR.

@jklymak
Copy link
Member

jklymak commented Oct 15, 2018

.. anyhoo, this needs a cbook and a hist test. And an API change note, I think, since there may have been folks who were doing the other alignements on uneven bins....

@anntzer
Copy link
Contributor Author

anntzer commented Oct 15, 2018

As usual, feel free to push the test, I honestly don't think it adds much.

@jklymak
Copy link
Member

jklymak commented Oct 15, 2018

I guess we should discuss the test policy - at some point I thought we were talking about aiming for 100% coverage, but maybe I misunderstood. I agree these tests would be trivial, but OTOH, these tests would be trivial...

@ImportanceOfBeingErnest
Copy link
Member

Since this PR is an API change, we can discuss it here, not elsewhere.

Sorry for using the notion CDF. Of course a histogram never shows a CDF, but it can approximate the value of a CDF at certain positions (which is shown in the plots above). I don't think there is an error in the way cumulative=True calculates the histogram. Hence the plot above is valid. Or inversely, if there was a problem with cumulative=True the plots above would actually show it. But they are fine, as expected.
You may argue that one should rather use a bar plot in such a case - but with that argument you can also remove hist completely from matplotlib. You may also argue that the right edge of a bar should be at the bin edge, which makes often sense, but then why allow for the rwidth argument? We cannot judge on how people around the world depict their statistics results and I feel the above shows a valid use case.
In total, who gets hurt by keeping the align as it is? Is matplotlib becoming any better by removing it? There doesn't even seem to be any real world issue involved. If you need to play the math-police here, you may put a warning in, saying that the result may lead to an uncommon representation of a histogram.

The above applies to the bar-like histtypes. Concerning the step-like types, there actually is a problem, because the result is wrong. For this case, throwing an error is fine, as it prevents non-sensical output and thereby helps the user.

@timhoffm
Copy link
Member

@ImportanceOfBeingErnest sorry, I did not read your example carefully (wasn't aware of the cumulative=True parameter). In fact, your CDF example is valid.

@anntzer anntzer force-pushed the unequal-bins-align branch from 0e27266 to 66caed5 Compare October 17, 2018 13:03
@anntzer
Copy link
Contributor Author

anntzer commented Oct 17, 2018

Added test. Need to think more about the cdf discussion above...

@anntzer
Copy link
Contributor Author

anntzer commented Oct 19, 2018

Also xref'ing #11506 re: align+rwidth+log.

@jklymak
Copy link
Member

jklymak commented Oct 19, 2018

Marking WIP, feel free to clear if you are done thinking about CDFs, and go ahead and ping me and I'll review...

@jklymak jklymak modified the milestones: v3.1.0, v3.2.0 Feb 22, 2019
@tacaswell
Copy link
Member

Did the rebase, but then read through the discussion and see that there is still some concern over CDFs, pushing to 3.3.

@tacaswell tacaswell modified the milestones: v3.2.0, v3.3.0 Aug 25, 2019
Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revoke my approval of the current version. I'm not sure that it does not affact valid use cases.

@timhoffm timhoffm dismissed their stale review August 25, 2019 16:28

I'm not sure that this does not affect valid use cases.

@QuLogic QuLogic modified the milestones: v3.3.0, v3.4.0 May 2, 2020
@QuLogic QuLogic added the status: needs comment/discussion needs consensus on next step label May 2, 2020
@jklymak jklymak marked this pull request as draft July 23, 2020 16:07
@QuLogic QuLogic modified the milestones: v3.4.0, v3.5.0 Jan 21, 2021
@QuLogic QuLogic modified the milestones: v3.5.0, v3.6.0 Aug 21, 2021
@timhoffm timhoffm modified the milestones: v3.6.0, unassigned Apr 30, 2022
@story645 story645 modified the milestones: unassigned, needs sorting Oct 6, 2022
@github-actions
Copy link

Since this Pull Request has not been updated in 60 days, it has been marked "inactive." This does not mean that it will be closed, though it may be moved to a "Draft" state. This helps maintainers prioritize their reviewing efforts. You can pick the PR back up anytime - please ping us if you need a review or guidance to move the PR forward! If you do not plan on continuing the work, please let us know so that we can either find someone to take the PR over, or close it.

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label May 19, 2023
@anntzer
Copy link
Contributor Author

anntzer commented May 21, 2023

Let's just track this at #12494.

@anntzer anntzer closed this May 21, 2023
@anntzer anntzer deleted the unequal-bins-align branch May 21, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants