FIX: if bins input to hist is str, treat like no bins #8638

tacaswell · 2017-05-18T02:03:47Z

PR Summary

This change causes the range of all data sets to be computed and
passed to numpy (which in turn uses the total range to compute the
'best' bins).

The existing code 'latches' the bins from the first data set to use
for the rest so this is can still lead to poor binning (if the data
sets are widely different).

closes #8636

This still needs some tests and possibly a more through thinking of how to deal with the 'string' bins and multiple data sets.

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant

anntzer · 2017-06-10T05:08:45Z

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

... unless you're willing to lookup the bin selector into the private np.lib.function_base._hist_bin_selectors.

tacaswell · 2017-06-12T03:48:14Z

Even if you are willing to lookup the selector, you still have to do the histogram once with all of the data and once for each data set.

anntzer · 2017-06-12T04:39:18Z

No, you only want to compute the bins once on the total dataset, and then histogram each individual dataset using the given bins, right?

eric-wieser · 2017-12-13T05:54:15Z

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

I think that this might be an acceptable cost. I'm hoping to get things fixed in numpy in 1.15, but it'll be a while before matplotlib's minimum version can rely on it.

Wouldn't it be better to have a 2x slowdown in exchange for correctness?

I suppose you could backport the histrogram functions from 1.15.

tacaswell · 2017-12-13T18:53:44Z

Ah, I now understand what @anntzer is saying about using the selector. I am pretty 👎 on touching numpy internals and checking for something that does not exist in all versions of numpy that we support. Open to vendoring the selectors (despite them being in numpy because we bounced a feature request from Matplotlib to numpy), but probably not the full histogram code (we just about finished getting rid of a partial backport of histogram!).

eric-wieser · 2017-12-13T19:07:56Z

I am pretty 👎 on touching numpy internals

As you should be, because they're about to move

Open to vendoring the selectors

If 1.15 adds histogram_edges, then you could vendor that function

anntzer · 2018-07-25T17:03:15Z

numpy 1.15 is out, now we just need to bump the dependency to it :)

tacaswell · 2019-02-24T20:55:54Z

rebased and added a test, still not sure if this is the right thing.

On one hand, it fixes a problem (when people pass in more than on distribution and bins as a string it was running the bin generation algorithm only considering the first dataset's limits) but only part way (it now uses the correct limits, but only considers the distribution of the first dataset).

eric-wieser · 2019-02-24T21:54:27Z

lib/matplotlib/axes/_axes.py

-        binsgiven = np.iterable(bins) or bin_range is not None
+        binsgiven = ((np.iterable(bins) and
+                      not isinstance(bins, str)) or
+                     bin_range is not None)


It might be worth matching the numpy logic here, as:

if bins is None: binsgiven = False else: # match the logic in numpy.lib.histograms._get_bin_edges if isinstance(bins, basestring): binsgiven = False elif np.ndim(bins) == 1: binsgiven = True else: binsgiven = False

Renaming to bins_array_given would make sense too.

eric-wieser · 2019-02-24T22:24:52Z

Would something like this be a better compromise?

try:
    from numpy.lib.histograms import histogram_bin_edges
except ImportError:
    def histogram_bin_edges(arr, bins, range=None, weights=None):
        if isinstance(bins, basestring):
            # rather than backporting the internals, just do the full computation.
            # If this is too slow for users, they can update numpy, or pick a manual number of bins
            return np.histogram(arr, bins, range, weights)[1]
        elif np.ndim(bins) == 0:
            # not strictly the behavior of histogram_bin_edges, but easier to not bother with the linspace here
            return bins
        else:
            return bins

Then call

bins = histogram_bin_edges(arr.ravel(), bins, range, weights)

tacaswell · 2019-02-25T02:05:44Z

bins = histogram_bin_edges(arr.ravel(), bins, range, weights)

Except that we are not guaranteed that the input is a 2D numpy array (or even equal length). Would probably have to do something like hstack to be safe....

Applied the in-line comments though.

eric-wieser · 2019-02-25T02:23:17Z

lib/matplotlib/axes/_axes.py

@@ -6666,7 +6675,7 @@ def hist(self, x, bins=None, range=None, density=None, weights=None,
        # If bins are not specified either explicitly or via range,
        # we need to figure out the range required for all datasets,
        # and supply that to np.histogram.
-        if not binsgiven and not input_empty:
+        if not bins_array_given and not input_empty:


Think you now need and bin_range is None here, which makes more sense anyway

eric-wieser · 2019-02-25T02:24:54Z

Would probably have to do something like hstack to be safe

I think concatenate would be more appropriate here. Was hoping for an approach that wouldn't make extra copies, but I'm not sure if that's a worthwhile or avoidable concern anyway.

eric-wieser · 2019-02-25T03:11:11Z

lib/matplotlib/tests/test_axes.py

+
+
+def test_hist_auto_bins():
+    _, bins, _ = plt.hist([[1, 2, 3], [4, 5, 6]], bins='auto')


A test on jagged input might be nice too

eric-wieser · 2019-02-25T03:11:41Z

PR description is pretty outdated after the title change

anntzer · 2019-02-25T05:35:50Z

lib/matplotlib/axes/_axes.py

@@ -6637,7 +6637,16 @@ def hist(self, x, bins=None, range=None, density=None, weights=None,
            bin_range = self.convert_xunits(bin_range)

        # Check whether bins or range are given explicitly.
-        binsgiven = np.iterable(bins) or bin_range is not None
+        if bins is None:


The whole thing is just bins_array_given = np.ndim(bins) == 1? (perhaps with additional comments indicating that this excludes str and None).

anntzer · 2019-02-25T05:38:18Z

I agree with concatenating everything and passing down to numpy. I doubt that's even close to being a bottleneck (and anyways users can pass in their own bins if they really care).

eric-wieser · 2019-02-26T04:30:04Z

lib/matplotlib/axes/_axes.py

-                if len(xi) > 0:
-                    xmin = min(xmin, np.nanmin(xi))
-                    xmax = max(xmax, np.nanmax(xi))
-            bin_range = (xmin, xmax)


You either need to keep this code for the case when bins is an integer, or you need to add a case for integral bins in the histogram_bin_edges backport

ah, good catch on that, I fell down a different rabbit hole (see my comment at the issue level)...

eric-wieser · 2019-03-19T02:45:47Z

lib/matplotlib/axes/_axes.py

+                # hard-code numpy's default
+                bins = 10
+            if range is None:
+                range = np.nanmin(arr), np.manmax(arr)


Note that numpy does not use the nan variants here.

Agreed that if we want to support nans, we should pre-remove them ourselves before passing them to histogram_bin_edges.

... also note the typo here....

Just took me about 15 minutes to sort out what the typo was....

tacaswell · 2019-03-19T12:52:30Z

The test failure is the svg fix from #13710 .

This change causes the range of all data sets to be computed and passed to numpy (which in turn uses the total range to compute the 'best' bins). The existing code 'latches' the bins from the first data set to use for the rest so this is can still lead to poor binning (if the data sets are widely different). closes matplotlib#8636

lib/matplotlib/axes/_axes.py

@eric-wieser

as suggested by @eric-wieser We are no longer tracking if the bins kwarg was passed, but if it was passed in is an array we should use as the bin edges. Simplify some internal logic.

timhoffm

Subject to CI pass.

tacaswell · 2019-04-01T22:52:52Z

Squashed this down to 4 commits.

Given the way this is going it is not going to backport cleanly....

… like no bins

…38-on-v3.1.x Backport PR #8638 on branch v3.1.x (FIX: if bins input to hist is str, treat like no bins)

tacaswell added this to the 2.1 (next point release) milestone May 18, 2017

afvincent mentioned this pull request Jun 20, 2017

Histogram range appears to be calculated incorrectly when using "bins='auto'" with 2D data #8778

Closed

tacaswell modified the milestones: 2.1 (next point release), 2.2 (next next feature release) Aug 29, 2017

tacaswell modified the milestones: needs sorting, v3.1 Jul 25, 2018

This was referenced Oct 19, 2018

ENH: When histogramming data with integer dtype, force bin width >= 1. numpy/numpy#12150

Merged

Dropping support for Py3.5 and numpy 1.10 #12358

Closed

tacaswell force-pushed the fix_auto_hist_range branch from cc55b33 to ab54089 Compare February 24, 2019 20:53

eric-wieser reviewed Feb 24, 2019

View reviewed changes

eric-wieser reviewed Feb 25, 2019

View reviewed changes

eric-wieser approved these changes Feb 25, 2019

View reviewed changes

anntzer reviewed Feb 25, 2019

View reviewed changes

eric-wieser reviewed Feb 26, 2019

View reviewed changes

tacaswell force-pushed the fix_auto_hist_range branch from 1f2de1b to 56c353f Compare March 19, 2019 02:42

eric-wieser reviewed Mar 19, 2019

View reviewed changes

tacaswell closed this Mar 25, 2019

tacaswell reopened this Mar 25, 2019

tacaswell force-pushed the fix_auto_hist_range branch from 503012d to 388a209 Compare March 31, 2019 18:58

tacaswell mentioned this pull request Mar 31, 2019

[WIP] Masking invalid x and/or weights in hist (#6483) #7133

Closed

3 tasks

anntzer reviewed Mar 31, 2019

View reviewed changes

lib/matplotlib/axes/_axes.py Outdated Show resolved Hide resolved

anntzer reviewed Mar 31, 2019

View reviewed changes

lib/matplotlib/axes/_axes.py Outdated Show resolved Hide resolved