Skip to content

A reasonable default width for bar() with units? #13236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
anntzer opened this issue Jan 20, 2019 · 41 comments
Open

A reasonable default width for bar() with units? #13236

anntzer opened this issue Jan 20, 2019 · 41 comments
Labels
keep Items to be ignored by the “Stale” Github Action topic: units and array ducktypes

Comments

@anntzer
Copy link
Contributor

anntzer commented Jan 20, 2019

In #12903 we merged a PR fixing the handling of the default width (0.8) when bar (barh, broken_barh) is called with unitized data (I was among the PR approvers; see also #13187 as followup). Briefly, the idea is to support both unitless and unitized widths, so that one can do

bar([dates])  # default width: 0.8
bar([dates], width=[timedeltas])  # or single timedelta
bar([categoricals])  # default width: 0.8
bar([categoricals], width=[unitless])  # or single scalar

The main trickiness is that the default width is unitless (0.8), but one can also pass a unitful width, so it's not clear how to add it to the unitized x; moreover, note that the unit of x can either correctly handle addition (time + timedelta), or not (categoricals, which need to go through the converter first). So the end design is something like

try:
    dx = deunitize(orig_x + orig_dx) - deunitize(orig_x)
except ...:
    dx = deunitize(orig_dx)

where orig_x and orig_dx are the user inputs, either unitized or deunitized.

However, I now feel like this is not really looking at the problem under the right angle: if the user input is unitized, what's the chance that 0.8 "deunitized-units" is a reasonable default width? It is rather low, I would guess. Instead, it would seem that the intent of the default would/should be "0.8 times the spacing between the bars" (in the common case of evenly spaced bars).

Hence, a solution may perhaps be to change the interpretation of a unitless width in the case of unitized data, to have it mean (e.g.) a fraction of the (mean) spacing between the bars (i.e. (max(x) - min(x)) / (n - 1)). Note that because bar did not support unitized data up to now, there are no backcompat concerns.

Mostly @jklymak I guess. Allowing myself to mark this as release critical mostly so that we don't have to go through a painful deprecation period if we agree on this, but feel free to untag.

@anntzer anntzer added Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. topic: units and array ducktypes labels Jan 20, 2019
@jklymak
Copy link
Member

jklymak commented Jan 20, 2019

I think this is reasonable, but a bit mysterious to only apply to unitized data.

Doesn’t obviate the need for the other PRs, but may simplify them somewhat.

@timhoffm
Copy link
Member

timhoffm commented Jan 20, 2019

Good catch, A fixed "deunitized default" does not make sense. However, the topic is a bit complex.

IMHO the unit-free numeric values corresponding to unitized data is semi-public. For simple cases, people should not have to know about it (i.e. plotting bars based on categoricals or dates). However, we cannot completely hide it, because people might want to draw additional elements to the Axes and thus might need to know about the actual data-space of the Axes.

Some general thoughts:

  • Unitized data and bare numbers should not be mixed within a single function call.
  • Interpreting width as relative is difficult. It works for simple cases, in particular for the defaults in which the bars are at subsequent integer positions. However, how do we interpret uneven x-spacing bar(x=[1, 2, 4], height=[1, 2, 3], width=.8? This is not well defined: Is the width of the second bar 2 or 1.5? Also it's certainly backward incompatible.
  • Therefore backcompat concerns come in when we want to make width relative everywhere. Alternatively, width would be absolute for plain numbers and relative for unitized data, which is quite a break in the symmetry.
  • A possible way out would be to change the default to None or "auto" and use .8 for bare numbered x. Calculate a 0.8 equivalent default for unitized data. But do not allow an arbitrary float width for unitized data. That would allow a simple default usage without a given width but prevent some interpretation hurdles of a generic "relative width".

@jklymak
Copy link
Member

jklymak commented Jan 20, 2019

The paradigm for units is to allow the converter to decide if it likes the non-unitized float and what to do with it.

@anntzer
Copy link
Contributor Author

anntzer commented Jan 20, 2019

The problem of having "auto" mean 0.8*(the mean spacing) for unitized data but otherwise requiring unitized widths for unitized data is that this does not allow adjusting the bar widths for categorical data, because for those you can really only do the computation afer deunitizing (an my guess is that quite a few people want to use width=1 in that case).
Hence the proposal to break the symmetry between unitless data (width is absolute) and unitized data (width is absolute if unitized, relative if not). It's not optimal, but so far no-one was using bar() with unitized data (as that did not work) so hopefully users will learn about that (and if they find this too confusing, we can always revisit the design).
I am definitely not proposing to change the interpretation of width for unitless data.

@anntzer
Copy link
Contributor Author

anntzer commented Jan 21, 2019

See #13254 for an implementation of the proposal.

@dopplershift
Copy link
Contributor

I see the appeal of using a relative meaning for width, but I'm really not wild about the idea that for this one corner (call it 25% of options) it means relative width, but for the rest of cases it means absolute. This kind of guessing is what usually gets us into long term maintenance troubles.

While I get purity arguments here, do we have a test case that actually shows problematic behavior in the current implementation WRT units and bar width? I get it might be suboptimal, but is it bone-headedly stupid?

@jklymak
Copy link
Member

jklymak commented Jan 24, 2019

You can’t do datetime + float. It fails right now on 3.0.2 unless you specify width in timedelta

@jklymak
Copy link
Member

jklymak commented Jan 24, 2019

Btw I agree that if we change to a relative default for some cases it should be relative for all cases.

@anntzer
Copy link
Contributor Author

anntzer commented Jan 24, 2019

The current default (as of master) is basically a bar of 0.8 days, because datetimes are converted to scalars on a 1-day scale. So you'll get something reasonably nice looking if your x's are ~1 day apart, but not if they're ~1y apart. For scalars, I guess that's typically not a problem because most bar plots effectively use the x-axis as ordinals (and thus consecutive integers)?
I guess the alternative is to document the default as 0.8-of-whatever-your-units-convert-to-unitless, but isn't the point of unit systems that you can't, well, do non-homogeneous operations?

@dopplershift
Copy link
Contributor

Well, unit support enables doing the same operation across values given with different scaling/offsets. What you're proposing is doing a completely different operation by default. IIUC:

  • Unitless: Plot bars that are from [x, x + 0.8]
  • Unit-ed: Plot bars that are from [x, x + 0.8 * average_spacing]

While it could make for better default plotting out of the box, it means some users will be confused by the change in the plot when they switch from plotting on a date time axis to one on plain floats. I'm not full out against it, but it does make me pause. And it does seem to be against the general grain of matplotlib development later where we're moving away from such magical, polymorphic behavior.

@anntzer
Copy link
Contributor Author

anntzer commented Jan 24, 2019

I guess I just have to accept I fundamentally don't understand what is the general philosophy behind unit support in Matplotlib (we already had this debate some time ago and frankly, every time I try to understand it, I fail... no need to relitigate the whole thing here); in the meantime, just documenting the actual behavior (0.8-of-whatever-your-units-converts-to) seems good enough.

@tacaswell tacaswell added this to the v3.1.0 milestone Feb 4, 2019
@jklymak jklymak removed the Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. label Feb 10, 2019
@jklymak jklymak modified the milestones: v3.1.0, v3.2.0 Feb 10, 2019
@timhoffm
Copy link
Member

Are we sure this is not release critical? 3.1 will include the changes for unit-handling from #12903.

I'm not recalling the complete discussion anymore, but would it make sense to require a width argument for unitized data (maybe except categoricals, which are essentially integer units and can do with the global default of 0.8). That would at least be a clean API and error out with a reasonable error message. We can always add more clever defaults later on.

@timhoffm
Copy link
Member

timhoffm commented Mar 5, 2019

Ping @anntzer @jklymak @tacaswell concerning my previous comment.

A side effect of #12903 is that unitized x values will generate a working plot in 3.1. However, this uses the default bar width of 0.8 which is in generally not suitable. E.g.

x = [datetime.datetime(2019, i+1, 1) for i in range(10)]
y = range(10)
plt.bar(x, y)

will result in extremely small bars:
image

Is that behavior ok for 3.1 or do we have to do something about it (a general default seems difficult, so the best option might be to error out for unitized-x without width.

@jklymak
Copy link
Member

jklymak commented Mar 5, 2019

I prefer this behaviour compared to an error. For better for worse 0.8 is a valid matplotlib datetime according to the converter.

@anntzer
Copy link
Contributor Author

anntzer commented Mar 5, 2019

Honestly, I don't really care anymore at that point because (per my comment above) I just don't understand what is the coherent plan behind unit support; as long as it doesn't make unitless plotting more difficult I'm fine with everything.

@timhoffm
Copy link
Member

timhoffm commented Mar 5, 2019

For better for worse 0.8 is a valid matplotlib datetime according to the converter.
That's true, but it's not a reasonable default value for plotting datetimes.

Maybe, overall this is not too critical. We're going from an error to a bad default, which people will likely not rely on. So it's still realtively harmless to change to a better default or raise an error again in future versions.

@jklymak
Copy link
Member

jklymak commented Mar 5, 2019

I don't think this errored before, did it? What errored was specifying width as a timedelta...

@ImportanceOfBeingErnest
Copy link
Member

However, this uses the default bar width of 0.8 which is in generally not suitable.

There doesn't seem to be any difference between unsuitable data with and without units. You have the same problem of unsuitable bar widths when plotting plt.bar(range(0,1000,100), range(10)). I don't think the main problem is the default width itself. Because if you get a plot with too thin or too thick bars, you'd look up how to change the bar size in the docs and easily find the width argument. Instead the problem is that the resulting plot may be completely empty because from mpl2.0 on bars don't have any edges any more, and shapes below a width of 1 pixel have a chance of not being rendered. Users who get a completely empty plot are much less likely to find the solution to be related to the width argument.

@jklymak
Copy link
Member

jklymak commented Mar 6, 2019

@ImportanceOfBeingErnest, good point. Not sure what we should do about it or the history behind the style change.

@timhoffm
Copy link
Member

timhoffm commented Mar 6, 2019

A simple approximate solution would be to check dx / width. If that‘s larger than a certain factor, say 200, issue a warning.

@kylrth
Copy link

kylrth commented May 22, 2019

Can we do something about this? I just ran the following:

x = [3858500.0, 3858000.0, 3857500.0, 3857000.0, 3856500.0, 3856000.0, 3855500.0, 3855000.0, 3854500.0, 3854000.0]
y = [105.0, 205.0, 130.0, 190.0, 145.0, 155.0, 125.0, 160.0, 115.0, 170.0]
plt.barh(x, y)
plt.show()

and got what I thought was a blank plot; I spent the next 25 minutes trying to figure out what I did wrong. In my view, the sensible default would be 0.8 x the distance between each bar, even if the bars are different distances apart. Then if the user thinks it's ugly he can set the bar width (or height in the case of barh) explicitly.

@tacaswell
Copy link
Member

Maybe the minimal thing we should do here is to set the edge color on bar and barh? The way we are failing here is that the bins are narrow enough that when rasterized they are rounding to invisible. By default we draw patches without and border (because otherwise they take up slightly more space than they should because we stroke the edge with the brush centered on the edge). The thickness of the edge is sized in physical units so even with extremely narrow bars they still show up, ex

x = [3858500.0, 3858000.0, 3857500.0, 3857000.0, 3856500.0, 3856000.0, 3855500.0, 3855000.0, 3854500.0, 3854000.0]
y = [105.0, 205.0, 130.0, 190.0, 145.0, 155.0, 125.0, 160.0, 115.0, 170.0]
plt.barhx, y, edgecolor='C0')
plt.show()

has all of the bars visible.

@tacaswell
Copy link
Member

diff --git a/lib/matplotlib/axes/_axes.py b/lib/matplotlib/axes/_axes.py
index 40e48fd55..7354ba152 100644
--- a/lib/matplotlib/axes/_axes.py
+++ b/lib/matplotlib/axes/_axes.py
@@ -2275,7 +2275,7 @@ class Axes(_AxesBase):
         color = kwargs.pop('color', None)
         if color is None:
             color = self._get_patches_for_fill.get_next_color()
-        edgecolor = kwargs.pop('edgecolor', None)
+        edgecolor = kwargs.pop('edgecolor', color)
         linewidth = kwargs.pop('linewidth', None)
 
         # Because xerr and yerr will be passed to errorbar, most dimension

is enough to make that change, but it changes a lot of test images.

@timhoffm
Copy link
Member

timhoffm commented Sep 1, 2019

Changing the style just to work around this issue feels wrong. It would be sort of ok to enable edges globally, but not as a workaround in a single function. Moreover, the result is most likely still not what the user wants. So this workaround would only be an indirect way of hinting what is going wrong.

In descending priority a fix should be:

  • ideally give reasonable default widths (I know we've been discussing this a bit and I'm not sure if there is a solution)
  • Can we detect that an Artist is rendered to 0 pixels. If so, we should warn about that.
  • If we cannot determine this exactly, we can still approximate a dx/width threshold and warn if we assume that width would be too small.

@tacaswell
Copy link
Member

I think trying to detect zero-width draws is going to be both very expensive computationally and very hard to actually get right. For the vector outputs we will just output the points to draw and will be at the mercy of the final renderer to determine how many pixels will be on the screen. Because the user can change the dpi / axes size / figure size / axes limits, how much of data space is covered by a pixel can be changed wildly between when we create the Rectangle artists and when the image is actually saved. I can not see what the "right time" to do this would be.

Maybe we could warn from inside our Agg wrappers?

@anntzer
Copy link
Contributor Author

anntzer commented Sep 2, 2019

For raster output this can probably be done without too much difficulty within the path snapper. Or we can just not snap paths in that case...

@ImportanceOfBeingErnest
Copy link
Member

Something to consider is that there are presumable lots of cases where you do want to have a patch not being drawn if it is less than a pixel, and not get a warning. E.g. a map with the world's countries as patches shouldn't give a warning by default, even though it contains the Principality of Monaco.

@anntzer
Copy link
Contributor Author

anntzer commented Sep 2, 2019

I would suggest only doing so when the snapping causes the path to exactly come back on itself (while the path was not coming back over itself in the absence of snapping).

@jklymak
Copy link
Member

jklymak commented Sep 2, 2019

Personally think the way forward is a width=“auto” that tries to do the best it can and then an rcParam to set the default width.

@snowzky
Copy link

snowzky commented Sep 14, 2019

I would appreciate an easy way to get the unit used by width, after a plot has been made, when handling time series data.
Something like .get_width_units() applied to the bar plot - or whatever, but an easy usable thing.
Or if it is always 'days' when using time data, then at least have this stated in the documentation.

I had this issue yesterday, when plotting data with timestamps 10 minutes apart. Matplotlib bar evidently sees the unit as days, something I had to trial and error to find out, guessing my way to a correct width.
If someone did not know their data well enough, they might not even catch the error of overlapping bars (since no edge is drawn by default) - or conversely, really small bars as exemplified above.
The confusing, for an inexperienced user especially, was made bigger by the fact that the same data plotted with pandas has the width unit set to the more reasonable 10 minutes.

@jklymak
Copy link
Member

jklymak commented Sep 14, 2019

You should be able to set the width with a timedelta or Timedelta64. It’s not trivial to explain how bar widths interact with all the units If you had a suggestion for where it would have been helpful to see this in the docs that would be very helpful

@ImportanceOfBeingErnest
Copy link
Member

An advantage of the currently implemented way is that it's totally deterministic. Units are always data units. And width is 0.8 data units, unless changed by the user. The same is actually true for pandas (though it might be 0.6 instead of 0.8), but the confusing bit with pandas is that barplot units are categorical: each bar is 1 unit away from the next, independent on what the formatter says. In that sense pandas plots might result in completely confusing output when an unequally spaced timeseries is plotted, such that bars are equal in width in display space, but look unequal in width in data space because the formatter suggests different axis units.

@jklymak
Copy link
Member

jklymak commented Sep 14, 2019

In master, after #12903 the current default is a bare 0.8 float with no unit information, not 0.8 in "data units" (I don't think it was ever 0.8 in "data units").

If x0 has units, and x = convert(x0) is the float representation, and dx0 is the width, then we try converting dx0 using dx = convert(x0+dx0)-x before just doing convert(dx). We do this to allow unit-aware dx0 and to allow units that have __add__ defined between them (like datetime and timedelta in particular) work as widths.

If anyone wants to have a unitized bar that does something else, i.e. x0 = myUnitPacker(5, 'gallon') and dx0 = myUnitPacker(2, 'liter') and they have defined __add__ between these two units, then this width will now work. With the current default on 0.8, it would be up to the unit converter to decide what 0.8 means. Some converters pass through unitless data, some do not.

@ImportanceOfBeingErnest
Copy link
Member

not 0.8 in "data units" (I don't think it was ever 0.8 in "data units").

We call the units of the axis scale "data" throughout, e.g. ax.transData.

@jklymak
Copy link
Member

jklymak commented Sep 14, 2019

That is confusing though, because an axis has "units" that get converted by the axis' unit converter to floats. I believe what you are referring to are data co-ordinates.

@ImportanceOfBeingErnest
Copy link
Member

I'm no native speaker, but I believe you would call the distance between 2 and 1 on an axis "1 (data) unit" not "1 (data) coordinate"?

@jklymak
Copy link
Member

jklymak commented Sep 15, 2019

Just to clarify: The problem with our units handling as pointed out by @snowzky, is that at the moment is that the user has no way to know what "0.8" means relative to the units they are plotting in. i.e. imagine if units were lengths, and they passed in bar(myUnits([0.4, 1.2, 0.7], 'meters'), [12, 11, 19]) they have no way of knowing what the unit converter does to the default 0.8. Is 0.8 translated assuming that it is 'meters'? Or maybe kilometres (in which case the default bars will be far too thick), or maybe micrometers, in which case the default bars are much too thin. In the case of datetime objects, 0.8 days makes sense, but so would 0.8 seconds. For some units, it is possible that convert(0.8) will even error out and bar will fail with the default width.

I'm not sure what the way forward is. Making the width the median distance between x values is tempting, but it is also impossible to report that width (with appropriate units) back to the user because we lose that information immediately upon conversion. We have been talking about a new data framework where the raw x and dx are retained as well as the converted versions. That would be a big change, but something like that is on the roadmap for 4.x

@timhoffm
Copy link
Member

I'm tempted to go with a new default 'auto', which is 0.8 median width.

  • It's backward compatible with the current behavior of
    • non-unitized equidistant data
    • labelled data
  • It gives better results for other unitized equidistant data.
  • It does not affect users who have explicitly set the width to tune the current bad behavior.

I assume that all use-cases that use the current default and that would be affected by the proposed change are more or less heavily broken with too small or broad bars. Breaking them is a penalty we may consider bearable.

@github-actions
Copy link

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label May 30, 2023
@kylrth
Copy link

kylrth commented May 30, 2023

Thanks to the ping from the bot, I tested my example from above today with version 3.7.1:

from matplotlib import pyplot as plt

x = [3858500.0, 3858000.0, 3857500.0, 3857000.0, 3856500.0, 3856000.0, 3855500.0, 3855000.0, 3854500.0, 3854000.0]
y = [105.0, 205.0, 130.0, 190.0, 145.0, 155.0, 125.0, 160.0, 115.0, 170.0]
plt.barh(x, y)
plt.show()

very thin bars

Are we satisfied with this? The bars are visible now, which means people like me who are confused why their plot is blank will have a better idea of what's happening. But obviously it's an unusable default.

@jklymak jklymak added keep Items to be ignored by the “Stale” Github Action and removed status: inactive Marked by the “Stale” Github Action labels May 30, 2023
@jklymak
Copy link
Member

jklymak commented May 30, 2023

I thon @timhoffm 'auto' proposal is a good one. Someone just needs to implement it in a reasonable way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep Items to be ignored by the “Stale” Github Action topic: units and array ducktypes
Projects
None yet
Development

No branches or pull requests

8 participants