Skip to content

FIX: if bins input to hist is str, treat like no bins #8638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 2, 2019

Conversation

tacaswell
Copy link
Member

PR Summary

This change causes the range of all data sets to be computed and
passed to numpy (which in turn uses the total range to compute the
'best' bins).

The existing code 'latches' the bins from the first data set to use
for the rest so this is can still lead to poor binning (if the data
sets are widely different).

closes #8636

This still needs some tests and possibly a more through thinking of how to deal with the 'string' bins and multiple data sets.

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant

@tacaswell tacaswell added this to the 2.1 (next point release) milestone May 18, 2017
@anntzer
Copy link
Contributor

anntzer commented Jun 10, 2017

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

... unless you're willing to lookup the bin selector into the private np.lib.function_base._hist_bin_selectors.

@tacaswell
Copy link
Member Author

Even if you are willing to lookup the selector, you still have to do the histogram once with all of the data and once for each data set.

@anntzer
Copy link
Contributor

anntzer commented Jun 12, 2017

No, you only want to compute the bins once on the total dataset, and then histogram each individual dataset using the given bins, right?

@tacaswell tacaswell modified the milestones: 2.1 (next point release), 2.2 (next next feature release) Aug 29, 2017
@eric-wieser
Copy link
Contributor

Computing the histogram once with all of the data and the using those bins may be a better option (but involves computing the histograms twice).

I think that this might be an acceptable cost. I'm hoping to get things fixed in numpy in 1.15, but it'll be a while before matplotlib's minimum version can rely on it.

Wouldn't it be better to have a 2x slowdown in exchange for correctness?

I suppose you could backport the histrogram functions from 1.15.

@tacaswell
Copy link
Member Author

Ah, I now understand what @anntzer is saying about using the selector. I am pretty 👎 on touching numpy internals and checking for something that does not exist in all versions of numpy that we support. Open to vendoring the selectors (despite them being in numpy because we bounced a feature request from Matplotlib to numpy), but probably not the full histogram code (we just about finished getting rid of a partial backport of histogram!).

@eric-wieser
Copy link
Contributor

I am pretty 👎 on touching numpy internals

As you should be, because they're about to move

Open to vendoring the selectors

If 1.15 adds histogram_edges, then you could vendor that function

@anntzer
Copy link
Contributor

anntzer commented Jul 25, 2018

numpy 1.15 is out, now we just need to bump the dependency to it :)

@tacaswell
Copy link
Member Author

rebased and added a test, still not sure if this is the right thing.

On one hand, it fixes a problem (when people pass in more than on distribution and bins as a string it was running the bin generation algorithm only considering the first dataset's limits) but only part way (it now uses the correct limits, but only considers the distribution of the first dataset).

binsgiven = np.iterable(bins) or bin_range is not None
binsgiven = ((np.iterable(bins) and
not isinstance(bins, str)) or
bin_range is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth matching the numpy logic here, as:

if bins is None:
    binsgiven = False
else:
    # match the logic in numpy.lib.histograms._get_bin_edges
    if isinstance(bins, basestring):
        binsgiven = False
    elif np.ndim(bins) == 1:
        binsgiven = True
    else:
        binsgiven = False

Renaming to bins_array_given would make sense too.

@eric-wieser
Copy link
Contributor

eric-wieser commented Feb 24, 2019

Would something like this be a better compromise?

try:
    from numpy.lib.histograms import histogram_bin_edges
except ImportError:
    def histogram_bin_edges(arr, bins, range=None, weights=None):
        if isinstance(bins, basestring):
            # rather than backporting the internals, just do the full computation.
            # If this is too slow for users, they can update numpy, or pick a manual number of bins
            return np.histogram(arr, bins, range, weights)[1]
        elif np.ndim(bins) == 0:
            # not strictly the behavior of histogram_bin_edges, but easier to not bother with the linspace here
            return bins
        else:
            return bins

Then call

bins = histogram_bin_edges(arr.ravel(), bins, range, weights)

@tacaswell
Copy link
Member Author

tacaswell commented Feb 25, 2019

bins = histogram_bin_edges(arr.ravel(), bins, range, weights)

Except that we are not guaranteed that the input is a 2D numpy array (or even equal length). Would probably have to do something like hstack to be safe....

Applied the in-line comments though.

@@ -6666,7 +6675,7 @@ def hist(self, x, bins=None, range=None, density=None, weights=None,
# If bins are not specified either explicitly or via range,
# we need to figure out the range required for all datasets,
# and supply that to np.histogram.
if not binsgiven and not input_empty:
if not bins_array_given and not input_empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you now need and bin_range is None here, which makes more sense anyway

@eric-wieser
Copy link
Contributor

Would probably have to do something like hstack to be safe

I think concatenate would be more appropriate here. Was hoping for an approach that wouldn't make extra copies, but I'm not sure if that's a worthwhile or avoidable concern anyway.



def test_hist_auto_bins():
_, bins, _ = plt.hist([[1, 2, 3], [4, 5, 6]], bins='auto')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test on jagged input might be nice too

@eric-wieser
Copy link
Contributor

PR description is pretty outdated after the title change

@@ -6637,7 +6637,16 @@ def hist(self, x, bins=None, range=None, density=None, weights=None,
bin_range = self.convert_xunits(bin_range)

# Check whether bins or range are given explicitly.
binsgiven = np.iterable(bins) or bin_range is not None
if bins is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing is just bins_array_given = np.ndim(bins) == 1? (perhaps with additional comments indicating that this excludes str and None).

@anntzer
Copy link
Contributor

anntzer commented Feb 25, 2019

I agree with concatenating everything and passing down to numpy. I doubt that's even close to being a bottleneck (and anyways users can pass in their own bins if they really care).

if len(xi) > 0:
xmin = min(xmin, np.nanmin(xi))
xmax = max(xmax, np.nanmax(xi))
bin_range = (xmin, xmax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You either need to keep this code for the case when bins is an integer, or you need to add a case for integral bins in the histogram_bin_edges backport

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch on that, I fell down a different rabbit hole (see my comment at the issue level)...

@tacaswell tacaswell force-pushed the fix_auto_hist_range branch from 1f2de1b to 56c353f Compare March 19, 2019 02:42
# hard-code numpy's default
bins = 10
if range is None:
range = np.nanmin(arr), np.manmax(arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that numpy does not use the nan variants here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that if we want to support nans, we should pre-remove them ourselves before passing them to histogram_bin_edges.

Copy link
Member

@jklymak jklymak Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... also note the typo here....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took me about 15 minutes to sort out what the typo was....

@tacaswell
Copy link
Member Author

The test failure is the svg fix from #13710 .

@tacaswell tacaswell closed this Mar 25, 2019
@tacaswell tacaswell reopened this Mar 25, 2019
This change causes the range of all data sets to be computed and
passed to numpy (which in turn uses the total range to compute the
'best' bins).

The existing code 'latches' the bins from the first data set to use
for the rest so this is can still lead to poor binning (if the data
sets are widely different).

closes matplotlib#8636
@tacaswell tacaswell force-pushed the fix_auto_hist_range branch from fbdf09b to a1cad69 Compare April 1, 2019 19:45
as suggested by @eric-wieser

We are no longer tracking if the bins kwarg was passed, but if
it was passed in is an array we should use as the bin edges.

Simplify some internal logic.
@tacaswell tacaswell force-pushed the fix_auto_hist_range branch from a1cad69 to c6f0554 Compare April 1, 2019 22:24
Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject to CI pass.

@tacaswell
Copy link
Member Author

Squashed this down to 4 commits.

Given the way this is going it is not going to backport cleanly....

@anntzer anntzer merged commit f27cde1 into matplotlib:master Apr 2, 2019
meeseeksmachine pushed a commit to meeseeksmachine/matplotlib that referenced this pull request Apr 2, 2019
tacaswell added a commit that referenced this pull request Apr 2, 2019
…38-on-v3.1.x

Backport PR #8638 on branch v3.1.x (FIX: if bins input to hist is str, treat like no bins)
@tacaswell tacaswell deleted the fix_auto_hist_range branch April 2, 2019 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plt.hist chooses improper range when using string-based bin options
5 participants