Skip to content

ENH: Add grouped_bar() method #28560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

timhoffm
Copy link
Member

@timhoffm timhoffm commented Jul 13, 2024

This is a WIP to implement #24313. It will be updated incrementally. As a first step, I've designed the data and label input API.

Please check the included example, which explains the API variants and motivations/consistency considerations. It's a WIP for the illustration and discussion of the API, and will not go into the final PR

I've flagged as "work in progress" because many aspects are not implemented yet. **But please give comments as I consider the implemented parts as complete proposal for that aspects.**

The API proposal is complete. Please check whether it is reasonable. You can get an overview the preliminary overview example and in the API docs for grouped_bar. Design decisions and background are listed in right below in the todos:

👉 Ready for review

Todos:

  • API: How to structure and interpret the two data dimensions and the associated labels.

  • For now, only categorical labels are supported as x. Should we also support numbers as with bar?
    Numbers are now also supported.

  • How to parametrize bar width and group spacing?
    A bar group takes horizontal space w (default: 1).

    I believe explicit bar widths are not helpful because they can easily overflow if enough bars are in a group, and "reasonable" values need fine-tuning for per number of bars in a group. I therefore suggest to define the spacing (margin / padding; name t.b.d.) in relative units. Options:

    • padding in units of w: e.g. pad=0.1 would mean a bar group targeting the interval (x-0.5*w, x+0.5*w) would have the outer bar edges at (x-(0.5-pad*w), x+(0.5-pad*w) = (x-0.4w, x+0.4w)`.
    • padding as multiples of the bar width: e.g. a pad=1 would mean that the bar width is chosen so that there's a padding of 1-bar width on both sides of the group.

    I believe, we can choose reasonable defaults for both cases. It's possibly easier to tune a desired look with the second approach, but harder if one wants to fully control positions.

    Decision: Implemented as multiples of bar width

  • Do we want to support spacing between the bars in a group?
    Yes. Spacing should follow the same relative conventions as group padding.
    Since we have num_bars times the spacing "spacing in units of w will require tuning depending on the number of bars.
    Spacing would be universal when given as "fraction of bar width".
    When defaulting to 0 (my slight preference) both variants will work for any number of bars by default.

  • How to support horizontal bars?
    Use orientation parameter. Replicating to a separate function grouped_barh is overly cumbersome.

  • Do we want to support stacked bars? And grouped stacked bars? - Not for now/here.
    While stacked bars could use the same input datastructures, other plotting parameters like width and spacing settings do not match, and some bar() flexibility like varying width will not be available via grouped_bar(). I thus recommend not to try and squeeze stacked bars in to grouped_bar(). Stacking bars is already much simpler with the existing bar() methods compared to grouping bars. bar() is thus more sufficient. If we want a higher level interface, a separate stacked_bar() would be a thin wrapper around bar(). I don't see a reasonable function for stacked and grouped bars, because the addtional dimension of data and labelling would make the interface very difficult. If you need stacked and grouped bars, you should use a loop of grouped_bar() with adjusted bottom.

  • support kwargs (Rectangle properties)

  • support vectorized style parameters - at least colors

  • optional: support bottom values, so that one could stack (but could be added later)

  • dataset labels naming? label as in bar(), or labels or dataset_labels or …
    Note: stackplot() uses labels. This would be an argument for consistency. What's different:
    We potentially have two labels: dataset labels and group labels. This may warrant deviating from the simple labels
    apporach to disambiguate them. OTOH, I feel "dataset" is quite a generic name (but I don't have any better) so that
    it does not make things much clearer. The group labels are implicit in x (or historically in bar(), called tick_labels.
    I'm going with labels because its

    • consistent with stackplot() and bar() (which uses label)
    • consistent with legend labels always being specified through label
    • a longer name does not significantly help with disambiguation
  • add pyplot wrapper

  • add mypy stubs

  • improve documentation

  • add tests

  • Should we mark this as provisional? This would allow to react to user feedback. -> Yes.

  • What is the return type? List of BarContainer, a new BarGroupContainer, something else, nothing for a start (because we can't decide yet)?
    Going with list[BarContainer}. This is most simple and in analogy to what stackplot() does. Since the API is provisional, we can still reconsider.

@story645
Copy link
Member

For now, only categorical labels are supported as x. Should we also support numbers as with bar?

Am thinking of a bunch of situations (date times, coded data, machine learning labels) where the numbers are really categoricals but the dtype isn't string and I think folks will expect it to just work. Also the categorical remapping mechanism isn't terribly robust 😓 so I could see folks wanting to just use numbers + formatter if they're trying to do something like animated/interactive sorting of bars (something like bar chart race) where the position of the bar is actually independent of its label as categorical will maintain the same cat->int mapping as stuff is added to the chart. Which tldr, yes numbers b/c that allows for a lot of fine tuning of bar position.

@timhoffm
Copy link
Member Author

Added numeric values support for x. I've required that x ist equidistant, becuause otherwise positioning will be quite messy.

if they're trying to do something like animated/interactive sorting of bars (something like bar chart race) where the position of the bar is actually independent of its label as categorical will maintain the same cat->int mapping as stuff is added to the chart.

I've not thought this case though, but remember that grouped_bar() is just a convenience interface for positioning multiple sets of data. One boundary condition is that each group (=bars at one category/tick position) has the same datasets in the same order. If you need something exotic, you'll have to go back to drawing individual bars.

@story645
Copy link
Member

story645 commented Jul 15, 2024

If you need something exotic, you'll have to go back to drawing individual bars

I don't think this is technically feasible, but this has me thinking it'd be nice if the return object was a 2D bar container that supported slicing out all the segments at a given X or all the segments of a group for easier consistent updates.

@timhoffm timhoffm force-pushed the grouped_bar branch 3 times, most recently from 93c7651 to 5bab18d Compare September 11, 2024 11:42
@github-actions github-actions bot added the Documentation: API files in lib/ and doc/api label Sep 11, 2024
@timhoffm
Copy link
Member Author

timhoffm commented Sep 11, 2024

The API proposal is complete. Please check whether it is reasonable. You can get an overview the preliminary overview example
and in the API docs for grouped_bar. Design decisions and background are listed in the first comment of this PR. Feedback is welcome!

@tacaswell tacaswell added this to the v3.10.0 milestone Sep 13, 2024
@tacaswell
Copy link
Member

A couple of quick comments:

  • thoughts on how this could be extended if (when) we implement hierarchical tick labels?
  • for the return did you consider a dict? It might be worth considering a minimal custom class (__getitem__ by the labels and obj.remove()) as well? The two things I think are important are the by-label access (particularly when user passed us a dict to begin with) and the ability to remove the whole lot in one shot. Do we want to return a new compound artist (that legit lives in the tree)?
  • is there a way to control styling by label (e.g. put a different hatch on each group)? I'm not sure this is useful, but also not sure it is not useful. "grab the returned object and do it your self" is also a good answer for now.
  • the first example in the docs should be turned into an figure comparison test to get at least some coverage?

@timhoffm
Copy link
Member Author

timhoffm commented Sep 13, 2024

thoughts on how this could be extended if (when) we implement hierarchical tick labels?

Not thought about it. But the fundamental API grouped_bar(x, heights) should be able to bear this. Likely with a hierarchial x and with or without substructuring in heights.

Side note: There is one aspect, I've been pondering: We could revese the positional argument order to groupde_bar(heights, x). We have kinda precendence for that in stairs. This would allow to (i) leave out x entirely for a quick look (use range() as default) and (ii) more importantly make it possible to pass a complex object grouped_bar(dataframe), which contains all the labelling information; this could then also be hierarchial. I've refrained from it to keep more analogy to bar. After all, our API is mostly low-level component-based. I believe to change that, it would be better to add a data preprocessing decorator that takes complex objects and decomposes them into the data components.

for the return did you consider a dict? It might be worth considering a minimal custom class (__getitem__ by the labels and obj.remove()) as well? The two things I think are important are the by-label access (particularly when user passed us a dict to begin with) and the ability to remove the whole lot in one shot. Do we want to return a new compound artist (that legit lives in the tree)?

We have no precedence that plotting functions return dicts. Additionally, one can plot without passing labels, which would us require to invent keys. So dict is not an option to me. Mid-term, I'd like to have better return objects, but his is a larger topic, which I don't want to figure out in this PR. I see the list of Container as a temporary approach - one of the main reasons to make this provisonal. It is certainly not the most elegant approach, but it's consistent (c.f. stackplot()), simple and gives users a grip on the underlying artists. Given that re-using the Artists is quite rare, that's good enough for now. Mid-term, I'd like to have better return objects. I will add an explicit note that this is likely to change.

is there a way to control styling by label (e.g. put a different hatch on each group)? I'm not sure this is useful, but also not sure it is not useful. "grab the returned object and do it your self" is also a good answer for now.

No. I've never seen this. Generally, groups are distinguished by location (and identified by x tick labels), and there's a point in that the bars of one dataset look the same over all groups to help make the connection. Per-group settings is not something I would want to encourage through API (though we could later introduce a parameter dict group_props if you convince me that's really a valid use case 😃). For now: Do it yourself.

the first example in the docs should be turned into an figure comparison test to get at least some coverage?

Yes, as the todo in the intial PR shows, tests are still open. I'm collecting feedback before so that I don't have to rewrite all tests in case I would have to overhaul the design.

@tacaswell
Copy link
Member

We could revese the positional argument order to groupde_bar(heights, x). We have kinda precendence for that in stairs.

I agree that is interesting, but as you point out could be solved by layers on top of this.

Additionally, one can plot without passing labels, which would us require to invent keys.

We could fallback to range() or [f"dataset_{i}" for in range()] but I agree that is less than great. That would probably mean we would also force those invented labels into the legend which is not great. I'm convinced we should not return a dict (but a bit sad about it).

Rather than a list could we do a custom object like:

class GroupedBarReturn:
    def __init__(self, bars):
        self.bars = bars
    def remove(self):
        [b.remove() for b in self.bars]

? This would also give us a reasonable expansion path if this API grew to add errorbars or annotations to the bars. Also a path to obj.by_labels[key]

Even though it is provisional, I'm a bit worried about returning an object with an API that would be annoying to replicate (e.g. not fall into the trap of ax.errorbar).

This may be a bigger discussion, but I think we should try to make everything we return from plotting methods to have a .remove method so "make it go away" is always the same.

That said, I'm not going to push this further and returning a list is acceptable to merge.

No. I've never seen this....For now: Do it yourself.

sounds reasonable.

@timhoffm
Copy link
Member Author

timhoffm commented Sep 13, 2024

Edit: I think I've reasonbly addressed the doc build failure.

Technical topic: I don't understand the doc failure:

/home/circleci/project/doc/api/axes_api.rst:294:<autosummary>:1: WARNING: py:obj reference target not found: Axes.dataLim [ref.obj]
/home/circleci/project/doc/api/axes_api.rst:449:<autosummary>:1: WARNING: py:obj reference target not found: AxesBase [ref.obj]

The call stacks goes through the .. autosummary:: sections axes_api.rst ending with the stated line numbers. They try to autogenerate the doc pages for update_datalim and add_child_axes but fail to resolve the following object references:

Extend the `~.Axes.dataLim` Bbox to include the given points.

and
Add an `.AxesBase` to the Axes' children; return the child Axes.

What's funny is that this PR does not touch the respective docstrings or targeted objects. Also, the only modification of axes_api.rst itself is the addition of grouped_bar (in another autosummary).

It seems that the failure is not related to the PR but some external influence (change of sphinx?). The failure may even be expected, because both references use AxesBase, which is not in our documentation, and the stable/devedocs render them as unresolved references. But why does it happen here and now? Other PRs do not see the problem. Do we have caching of the doc builds, and I'm the first to touch axes_api.rst to trigger regeneration?

@story645
Copy link
Member

more importantly make it possible to pass a complex object grouped_bar(dataframe), which contains all the labelling information

Also would make it trivial to transpose, which honestly is one of the best parts of the pandas API.

Which also, I think my request for an BarContainerArray and Tom's for a dict based return are both pointing towards a data frame type artist container - which yes I can see being out of scope here and also I think you proposed something like this once for Axes so you could do both mosaic label based and index based

timhoffm added a commit to timhoffm/matplotlib that referenced this pull request Sep 16, 2024
These were not rendered as links in the current docs, because the
associated code objects do not exist as targets in the docs. For some
reason, sphinx did not complain about it. But it does in some recent
PRs (extracted from matplotlib#28560) - I'm unclear why, but anyway the correct
solution is to change the references:

- `AxesBase` is not documented itself and in the given context, we don't
  have to make the specific distinction, so just link `Axes`
- The datalim attribute is documented as part of the Axes class, but
  does not have a dedicated target to link to. So we simply use a
  literal instead of a link.
@timhoffm
Copy link
Member Author

Changed to a provisional custom return object as proposed in #28560 (comment) to ease possible future migration.

timhoffm added a commit to timhoffm/matplotlib that referenced this pull request Sep 17, 2024
These were not rendered as links in the current docs, because the
associated code objects do not exist as targets in the docs. For some
reason, sphinx did not complain about it. But it does in some recent
PRs (extracted from matplotlib#28560) - I'm unclear why, but anyway the correct
solution is to change the references:

- `AxesBase` is not documented itself and in the given context, we don't
  have to make the specific distinction, so just link `Axes`
- The datalim attribute is documented as part of the Axes class, but
  does not have a dedicated target to link to. So we simply use a
  literal instead of a link.
@timhoffm timhoffm marked this pull request as draft January 22, 2025 18:49
@timhoffm timhoffm force-pushed the grouped_bar branch 4 times, most recently from dcd253e to 3825074 Compare January 23, 2025 14:05
@timhoffm timhoffm marked this pull request as ready for review January 23, 2025 16:43
@timhoffm
Copy link
Member Author

timhoffm commented Jan 23, 2025

From my perspective, this is ready for review.

@timhoffm timhoffm force-pushed the grouped_bar branch 2 times, most recently from 7cf1e6b to 5e82fbb Compare January 23, 2025 23:55
@timhoffm
Copy link
Member Author

@story645 A big thank you for a though round of review 🎉.

I've either changed the commented positions or replied why I think they should be kept as is. Please mark the comments as resolved on which you are ok with my reply so that only the open topics remain visible.

@timhoffm timhoffm force-pushed the grouped_bar branch 2 times, most recently from e023e0e to 162ce79 Compare January 24, 2025 11:18
Co-authored-by: hannah <story645@gmail.com>
Copy link
Member

@story645 story645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take or leave the table tweak

Co-authored-by: hannah <story645@gmail.com>
@timhoffm timhoffm added New feature and removed topic: pyplot API Documentation: examples files in galleries/examples Documentation: API files in lib/ and doc/api labels Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants