-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[Bug]: plt.hist
with bin='auto'
struggles with levy
-distributed points
#23524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
plt.hist
with bin='auto' struggles with
levy`-distributed pointsplt.hist
with bin='auto'
struggles with levy
-distributed points
bins='auto' just gets passed to numpy's histogram https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges I have no idea what a levy distribution is, but I assume its best not histogrammed with linear bins, so you probably need to pass it a set fixed of nonlinearly spaced bins, or transform the data to something more linear and then histogram it. For a non-sparse example: fig, ax = plt.subplots()
ax.hist(10**(np.random.randn(1000)*2), bins='auto')
plt.show() brings my machine to halt and clearly shouldn't be binned with linear bins. |
Please see scipy/scipy#14563 for context. |
Sure, but what can matplotlib do to help here? |
I'm not sure, but @tacaswell may have something in mind. I posted at his request. |
There have been requests to make Either this is something we should have a check to warn / fail on, This pathological example is failing because it has ~427k bins In [11]: list(map(len, np.histogram([1, 2, 3, 4, 1e6], bins='auto')))
Out[11]: [427494, 427495] which means in the default mode of The options I see:
I'm currently between 1 and 2 (and maybe a bit closer to 2). I also think that this means making |
I guess a 4th option is use "stair" by default if there are over X bins, where X is something like the screen resolution? |
I'd be in favor of a warning. Using that many bins is almost never a good solution, and if somebody really needs it, they should really not try to dray individual rectangles. - Maybe don't hard-code to 10k but to N times number of horizontal pixels. e.g. N = 3, which is clearly oversampling.
I have some concerns with dynamically switching the output type depending on inputs and screen size. I guess a 5th option would be to change the default to use filled stairs. - Basically we made |
Unfortunately we need to know this at artist creation time when we may not yet know the actual size of the output yet. For reasons I do not understand yet ax.hist([1, 2, 3, 4, 1e6], bins='auto', histtype='step') is still slow (but not as slow) to create the artist. I suspect we are doing some loops in Python that we could be vectorizing. ax.stairs(*np.histogram([1, 2, 3, 4, 1e6], bins='auto')) is quick (and because it is drawing the edge is actually visible as the edge is fixed width). I thought we did not go with option 5 when we added the |
Yeah, I think making There are two (at least) problems
The first seems a bit weird to me, verging on a bug. I feel like this has been discussed before, but why does a thin bar not have any visual output? For the case where 1 and 2 are a problem, say for nbins=10k or 100k, warning and using stair is in my opinion a reasonable thing to do that is back-compatible because it just broke gracelessly before, and better than just warning and then hanging or throwing an out-of-memory error. |
probably matplotlib/lib/matplotlib/axes/_axes.py Line 6744 in 931a81f
and matplotlib/lib/matplotlib/axes/_axes.py Line 6769 in 931a81f
ARAICS we did not discuss this. No matter what, we should add two new variants to We can separately decide whether we want to auto-switch for large bins, or maybe change the default later on. |
See also #12493 re: default artist type. Back then there was only step, not stairs, but the idea is basically the same. |
More context in which the problem of "auto" producing catastrophically small bin widths is discussed:
Numpy seemed open to adding a |
Bug summary
plt.hist
withbin='auto'
struggles when the data has a long tail. This was discovered in the context oflevy
- andlevy_l
- distributed data (see scipy/scipy#14563), but I've included a pathological example below.Code for reproduction
Actual outcome
Very-long execution, hanging, or crashing depending on the specific example. Once I got a memory error, I think.
Expected outcome
A plot
Additional information
See scipy/scipy#14563
Operating system
Windows
Matplotlib Version
3.5.2
Matplotlib Backend
module://matplotlib_inline.backend_inline
Python version
3.10.5
Jupyter version
NA
Installation
conda
The text was updated successfully, but these errors were encountered: