-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
FIX: if bins input to hist is str, treat like no bins #8638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
... unless you're willing to lookup the bin selector into the private |
Even if you are willing to lookup the selector, you still have to do the histogram once with all of the data and once for each data set. |
No, you only want to compute the bins once on the total dataset, and then histogram each individual dataset using the given bins, right? |
I think that this might be an acceptable cost. I'm hoping to get things fixed in numpy in 1.15, but it'll be a while before matplotlib's minimum version can rely on it. Wouldn't it be better to have a 2x slowdown in exchange for correctness? I suppose you could backport the histrogram functions from 1.15. |
Ah, I now understand what @anntzer is saying about using the selector. I am pretty 👎 on touching numpy internals and checking for something that does not exist in all versions of numpy that we support. Open to vendoring the selectors (despite them being in numpy because we bounced a feature request from Matplotlib to numpy), but probably not the full histogram code (we just about finished getting rid of a partial backport of histogram!). |
As you should be, because they're about to move
If 1.15 adds |
numpy 1.15 is out, now we just need to bump the dependency to it :) |
cc55b33
to
ab54089
Compare
rebased and added a test, still not sure if this is the right thing. On one hand, it fixes a problem (when people pass in more than on distribution and bins as a string it was running the bin generation algorithm only considering the first dataset's limits) but only part way (it now uses the correct limits, but only considers the distribution of the first dataset). |
lib/matplotlib/axes/_axes.py
Outdated
binsgiven = np.iterable(bins) or bin_range is not None | ||
binsgiven = ((np.iterable(bins) and | ||
not isinstance(bins, str)) or | ||
bin_range is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth matching the numpy logic here, as:
if bins is None:
binsgiven = False
else:
# match the logic in numpy.lib.histograms._get_bin_edges
if isinstance(bins, basestring):
binsgiven = False
elif np.ndim(bins) == 1:
binsgiven = True
else:
binsgiven = False
Renaming to bins_array_given
would make sense too.
Would something like this be a better compromise? try:
from numpy.lib.histograms import histogram_bin_edges
except ImportError:
def histogram_bin_edges(arr, bins, range=None, weights=None):
if isinstance(bins, basestring):
# rather than backporting the internals, just do the full computation.
# If this is too slow for users, they can update numpy, or pick a manual number of bins
return np.histogram(arr, bins, range, weights)[1]
elif np.ndim(bins) == 0:
# not strictly the behavior of histogram_bin_edges, but easier to not bother with the linspace here
return bins
else:
return bins Then call
|
Except that we are not guaranteed that the input is a 2D numpy array (or even equal length). Would probably have to do something like Applied the in-line comments though. |
lib/matplotlib/axes/_axes.py
Outdated
@@ -6666,7 +6675,7 @@ def hist(self, x, bins=None, range=None, density=None, weights=None, | |||
# If bins are not specified either explicitly or via range, | |||
# we need to figure out the range required for all datasets, | |||
# and supply that to np.histogram. | |||
if not binsgiven and not input_empty: | |||
if not bins_array_given and not input_empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you now need and bin_range is None
here, which makes more sense anyway
I think |
lib/matplotlib/tests/test_axes.py
Outdated
|
||
|
||
def test_hist_auto_bins(): | ||
_, bins, _ = plt.hist([[1, 2, 3], [4, 5, 6]], bins='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test on jagged input might be nice too
PR description is pretty outdated after the title change |
lib/matplotlib/axes/_axes.py
Outdated
@@ -6637,7 +6637,16 @@ def hist(self, x, bins=None, range=None, density=None, weights=None, | |||
bin_range = self.convert_xunits(bin_range) | |||
|
|||
# Check whether bins or range are given explicitly. | |||
binsgiven = np.iterable(bins) or bin_range is not None | |||
if bins is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole thing is just bins_array_given = np.ndim(bins) == 1
? (perhaps with additional comments indicating that this excludes str and None).
I agree with concatenating everything and passing down to numpy. I doubt that's even close to being a bottleneck (and anyways users can pass in their own bins if they really care). |
if len(xi) > 0: | ||
xmin = min(xmin, np.nanmin(xi)) | ||
xmax = max(xmax, np.nanmax(xi)) | ||
bin_range = (xmin, xmax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You either need to keep this code for the case when bins
is an integer, or you need to add a case for integral bins in the histogram_bin_edges
backport
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good catch on that, I fell down a different rabbit hole (see my comment at the issue level)...
1f2de1b
to
56c353f
Compare
lib/matplotlib/axes/_axes.py
Outdated
# hard-code numpy's default | ||
bins = 10 | ||
if range is None: | ||
range = np.nanmin(arr), np.manmax(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that numpy does not use the nan variants here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that if we want to support nans, we should pre-remove them ourselves before passing them to histogram_bin_edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... also note the typo here....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took me about 15 minutes to sort out what the typo was....
The test failure is the svg fix from #13710 . |
This change causes the range of all data sets to be computed and passed to numpy (which in turn uses the total range to compute the 'best' bins). The existing code 'latches' the bins from the first data set to use for the rest so this is can still lead to poor binning (if the data sets are widely different). closes matplotlib#8636
503012d
to
388a209
Compare
fbdf09b
to
a1cad69
Compare
as suggested by @eric-wieser We are no longer tracking if the bins kwarg was passed, but if it was passed in is an array we should use as the bin edges. Simplify some internal logic.
a1cad69
to
c6f0554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subject to CI pass.
Squashed this down to 4 commits. Given the way this is going it is not going to backport cleanly.... |
…38-on-v3.1.x Backport PR #8638 on branch v3.1.x (FIX: if bins input to hist is str, treat like no bins)
PR Summary
This change causes the range of all data sets to be computed and
passed to numpy (which in turn uses the total range to compute the
'best' bins).
The existing code 'latches' the bins from the first data set to use
for the rest so this is can still lead to poor binning (if the data
sets are widely different).
closes #8636
This still needs some tests and possibly a more through thinking of how to deal with the 'string' bins and multiple data sets.
Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).
PR Checklist