Skip to content

[Bug]: plt.hist with bin='auto' struggles with levy-distributed points #23524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mdhaber opened this issue Jul 30, 2022 · 12 comments
Closed

[Bug]: plt.hist with bin='auto' struggles with levy-distributed points #23524

mdhaber opened this issue Jul 30, 2022 · 12 comments

Comments

@mdhaber
Copy link

mdhaber commented Jul 30, 2022

Bug summary

plt.hist with bin='auto' struggles when the data has a long tail. This was discovered in the context of levy- and levy_l- distributed data (see scipy/scipy#14563), but I've included a pathological example below.

Code for reproduction

# natural example
from scipy.stats import levy_l
import matplotlib.pyplot as plt
r = levy_l().rvs(size=1000)
plt.hist(r, bins='auto')

# pathological example
plt.hist([1, 2, 3, 4, 1e6], bins='auto')

Actual outcome

Very-long execution, hanging, or crashing depending on the specific example. Once I got a memory error, I think.

Expected outcome

A plot

Additional information

See scipy/scipy#14563

Operating system

Windows

Matplotlib Version

3.5.2

Matplotlib Backend

module://matplotlib_inline.backend_inline

Python version

3.10.5

Jupyter version

NA

Installation

conda

@mdhaber mdhaber changed the title [Bug]: plt.hist with bin='auto' struggles with levy`-distributed points [Bug]: plt.hist with bin='auto' struggles with levy-distributed points Jul 30, 2022
@jklymak
Copy link
Member

jklymak commented Jul 30, 2022

bins='auto' just gets passed to numpy's histogram https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges
If that returns a million bins you are going to have trouble plotting them.

I have no idea what a levy distribution is, but I assume its best not histogrammed with linear bins, so you probably need to pass it a set fixed of nonlinearly spaced bins, or transform the data to something more linear and then histogram it.

For a non-sparse example:

fig, ax = plt.subplots()
ax.hist(10**(np.random.randn(1000)*2), bins='auto')
plt.show()

brings my machine to halt and clearly shouldn't be binned with linear bins.

@mdhaber
Copy link
Author

mdhaber commented Jul 30, 2022

Please see scipy/scipy#14563 for context.

@jklymak
Copy link
Member

jklymak commented Jul 30, 2022

Sure, but what can matplotlib do to help here?

@mdhaber
Copy link
Author

mdhaber commented Jul 30, 2022

I'm not sure, but @tacaswell may have something in mind. I posted at his request.

@tacaswell
Copy link
Member

There have been requests to make 'auto' our default (in fact the work on auto in numpy started as a Matplotlib PR that we asked the contributor to move upstream!).

Either this is something we should have a check to warn / fail on,

This pathological example is failing because it has ~427k bins

In [11]: list(map(len, np.histogram([1, 2, 3, 4, 1e6], bins='auto')))
Out[11]: [427494, 427495]

which means in the default mode of ax.hist we are trying to draw 427k rectangles, most of which are 0 height! One of the bottlenecks is just making that many Rectangle instances (it takes ~minutes and 4.7G of ram) and it will render (slowly), but because the bins are a tiny fraction of a pixel wide....they are invisible:

test

The options I see:

  1. do nothing, if the user asks for auto, they get auto and can wait
  2. add a warning if the number of bins produced by auto (or probably passed in by the user) is >10k we issue a user visible warning saying "you have asked to draw more bins than we suspect you have pixels for. We are going to try but you have been warned!"
  3. add an exception and raise if we can detect it is probably going to go badly

I'm currently between 1 and 2 (and maybe a bit closer to 2).

I also think that this means making auto the default of hist is a non-starter. We could add more layers of heuristics on top to try the different auto algorithms until one has "few enough" bins. I think we would want to push that back to numpy to add 'auto-min-bins' or some way of constraining the algorithms.

@jklymak
Copy link
Member

jklymak commented Jul 31, 2022

I guess a 4th option is use "stair" by default if there are over X bins, where X is something like the screen resolution?

@timhoffm
Copy link
Member

I'm currently between 1 and 2 (and maybe a bit closer to 2).

I'd be in favor of a warning. Using that many bins is almost never a good solution, and if somebody really needs it, they should really not try to dray individual rectangles. - Maybe don't hard-code to 10k but to N times number of horizontal pixels. e.g. N = 3, which is clearly oversampling.

I guess a 4th option is use "stair" by default if there are over X bins, where X is something like the screen resolution?

I have some concerns with dynamically switching the output type depending on inputs and screen size.

I guess a 5th option would be to change the default to use filled stairs. - Basically we made stairs to be able to easily draw np.histogram outputs. Should we not use it internally as well?
While this is an API change and we need a deprecation cycle, I doubt many people mess with the individual bar rectangles (I cannot imagine a reasonalbe use-case).

@tacaswell
Copy link
Member

Maybe don't hard-code to 10k but to N times number of horizontal pixels. e.g. N = 3, which is clearly oversampling.

Unfortunately we need to know this at artist creation time when we may not yet know the actual size of the output yet.

For reasons I do not understand yet

ax.hist([1, 2, 3, 4, 1e6], bins='auto', histtype='step')

is still slow (but not as slow) to create the artist. I suspect we are doing some loops in Python that we could be vectorizing.
On the other hand

ax.stairs(*np.histogram([1, 2, 3, 4, 1e6], bins='auto'))

is quick (and because it is drawing the edge is actually visible as the edge is fixed width).

I thought we did not go with option 5 when we added the stairs is because the return types are too different?

@jklymak
Copy link
Member

jklymak commented Jul 31, 2022

Yeah, I think making stairs the default will be hard as it will break a lot of people.

There are two (at least) problems

  1. sometimes bars are too thin to be plotted
  2. plotting many bars is really slow.

The first seems a bit weird to me, verging on a bug. I feel like this has been discussed before, but why does a thin bar not have any visual output?

For the case where 1 and 2 are a problem, say for nbins=10k or 100k, warning and using stair is in my opinion a reasonable thing to do that is back-compatible because it just broke gracelessly before, and better than just warning and then hanging or throwing an out-of-memory error.

@timhoffm
Copy link
Member

I suspect we are doing some loops in Python that we could be vectorizing.

probably

for m in tops:

and

for x, y, c in reversed(list(zip(xvals, yvals, color))):

I thought we did not go with option 5 when we added the stairs is because the return types are too different?

ARAICS we did not discuss this. hist() is only mentioned a few times in the PR #18275 and the related issuse. I suppose we focuessed primarily on getting the API and underlying Artist itself right.

No matter what, we should add two new variants to plt.hist(..., histtype=...) named 'stairs' and 'stairs-filled' (or 'filled-stairs'). It's reasonalbe to provide users with that functionality in hist(), and it's the fast one.

We can separately decide whether we want to auto-switch for large bins, or maybe change the default later on.

@anntzer
Copy link
Contributor

anntzer commented Aug 1, 2022

See also #12493 re: default artist type. Back then there was only step, not stairs, but the idea is basically the same.

@mwaskom
Copy link

mwaskom commented Aug 1, 2022

More context in which the problem of "auto" producing catastrophically small bin widths is discussed:

Numpy seemed open to adding a bins="auto-capped" (or whatever) option, but momentum stalled.

@mdhaber mdhaber closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants