Skip to content

feat: StepPatch #18275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 17, 2020
Merged

feat: StepPatch #18275

merged 12 commits into from
Sep 17, 2020

Conversation

andrzejnovak
Copy link
Contributor

@andrzejnovak andrzejnovak commented Aug 17, 2020

PR Summary

A new StepPatch artists based of PathPatch facilitation plotting histogram-like data and an associated pyplot.stairs() plotting function.

Implementation based on @timhoffm and @jklymak suggestions, after drafts and discussions in #15065 and #15019 and the discussion below.

Closes #6669 and possibly #5855.

y, x = np.histogram(np.random.normal(5, 2, 1000), bins=np.linspace(0, 10, 40))
y = y.astype(float)
y[10:12] = np.nan
y[17] = np.nan
y[-2:] = np.nan

f, ax = plt.subplots()
ax.stairs(y+3, baseline=2, label="X", fill=True, ls='-')
ax.stairs(y*2, baseline=None, label="X", fill=False, ls='--', lw=2)

image

It is also considerably faster than ax.hist for a filled plot, since it draws only one* patch.

image

PR Checklist

  • Has Pytest style unit tests
  • Code is Flake 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/next_api_changes/* if API changed in a backward-incompatible way

@andrzejnovak
Copy link
Contributor Author

Would be real cool if the crowd vocal at the previous iterations @jklymak @timhoffm @tacaswell @ImportanceOfBeingErnest could leave a comment if they're happy with this take before I sink time into tests and docs.

@jklymak
Copy link
Member

jklymak commented Aug 19, 2020

I haven't looked in super detail but this seems like a cleaner approach to me.

@timhoffm
Copy link
Member

timhoffm commented Aug 19, 2020

This definitively goes in the right direction. However, there are a few design decisions to be considered:

  • How specific do we want to go with the Artist class:

    • Polygon: pro: already there. con: we do not have concepts for e.g. baseline, so that it cannot be easily retrieved or altered.
    • StepLine or StepPolygon: structural description
    • HistLine: semantic description
      I tend to go with StepPolygon. HistLine seems unnecessarily specific. One might want to use this type of plot without being based on histogram data, e.g. for a discrete CDF.
  • Similar (but not necessarily identical) question: How specific do we want to go with the generating Axes method? stepline() vs. histline(). Note that the method should create and return respective Artist class (the current implementation just creates Polygons - but I assume that's work in progress).

  • I'm inclined to use the signature stepline/histline(edges, values, ...) and not stepline/histline(vals, bins=None, ...). Technically, you specify the bin edges and not the bins. It's also not directly comparable to hist(x, bins=None), because one of the major use cases is (semi)-automatically determining the edges with bins=None or bins=N. Specifically, I have doubts that a default of bins = arange(len(vals)) is reasonable.

@andrzejnovak
Copy link
Contributor Author

andrzejnovak commented Aug 19, 2020

@timhoffm props, this is exactly the type of discussion I was hoping could happen before more time is put into writing code that might not get used...

I think a new class, based on Polygon was a nudge in the right direction. It natively works for both filled/step histogram-like plot (further only referred to as histogram). The new class whatever it's called (let's say HistLine for now, but I don't care) is a restricted version of Polygon, where the "data" is entirely determined by values and edges and the mapping between these values and how it is displayed is unambiguously done by the Artist. Histograms should essentially be discretized pdfs and I don't see representing a cdf as one problematic at all. I don't have any strong feelings about what this class should be named. One more thing in favour of a new class is relatively easy addition of a custom legend handler, which can look different based on whether fill is true or false, without disturbing existing behaviours.

Re 2) The reason that histline returns multiple HistLine's (a.k.a modified Polygons) is Nan handling. I think the current scheme where Nan values in vals are handled, but Nan values in bins would throw an error makes sense in the targeted context (and will be made explicit). If there would be a way to set some edges of the polygon/histline "invisible" we could return just the artist, but as far as I could tell neither Patch nor Polygon have anything close to that kind of functionality. It could also very well result in some conflicts with the filled option. Returning a collection or a list of artists is also consistent behaviour with the likes of hist of fill_between.

Re 3) Having a default bins = np.arange(len(vals)+1) costs nothing and will save time in real use case scenarios - either exploratory or in many cases in HEP we have histogram-like plots representing cutflows or some arbitrary class assignment, where the bin edges are irrelevant (yes they should be a bar plot, but if you have 200 quantities to iterate over and you don't know apriori which will be which, it's a hassle). If you accept this premise, then the ordering makes sense as well. I think both edges and bins are semantically reasonable options. I would give a slight edge (ha) to bins because it would align with hist template and while I don't know if this is universal, I either don't pass anything to bins in hist or I pass a np.linspace which is again technically edges. In fact if you wanted to be pedantic about it a bins array should look like [[0,1], [1,2], [2,3]...], but that's just not practical.

@jklymak
Copy link
Member

jklymak commented Aug 19, 2020

I think one of the points of such an artist is to be able to set the x-data and the y-data (or bins and vals) after it is created and have the right thing happen. This is important for animations. So I don't think that returning multiple patches is a good design decision because you can't do that. It might be fine under the hood to have multiple patches, but ideally the user would get one object they could manipulate.

@andrzejnovak
Copy link
Contributor Author

The current nan handling could be pushed down into the HistLine artist, which could internally be a list of one or more polygons masquerading as one artist for get/set purposes. I don't know if there's another way to implement it and it's definitely a reasonable thing to have.

Btw I think StepPolygon is a good name for this new class.

@andrzejnovak
Copy link
Contributor Author

Alright, so I am not sure I particularly like this solution because it essentially maps a "collection" onto one artist, which seems weird. But in this way, the StepPolyCollection can have the behaviour requested by @jklymak for setting/getting and at the same time handle Nan values. I moved to inheriting from PolyCollection because I didn't see a way to build this class on Patch or Polygon because the artist has to be discontinuous to reasonably handle nan values. The drawback is that some things were more accessible when inheriting from Polygon - for example having the "fill" kwarg or get_linestyle() returning the usual shorthand like "--" whereas the same function on the collection returns the processed [A, (B, C)] format. It was actually interesting for me to learn that the Collections classes are not related to their respective baselines i.e. PolyCollections has nothing to do with Polygon

By the looks of it, this approach can yield the same result, but the code might get uglier.

@timhoffm
Copy link
Member

It's reasonable to build the class on top of PathPatch. Paths and their Artist container PathPatch can consist of multiple segments. See e.g. https://matplotlib.org/gallery/shapes_and_collections/compound_path.html#sphx-glr-gallery-shapes-and-collections-compound-path-py

Collections are more made for similar objects, e.g. a set of squares with, possibly different properties like color but maybe also size.

I'm coming back on issue 3) (edges vs bins) later.

@jklymak
Copy link
Member

jklymak commented Aug 20, 2020

I'll stick my oar in on edges versus bins. bins is ambiguous compared to edges and admits the bar-like interpretation of "bin 1, bin 2, ...bin n" where we want to make it clear its the edges that need to be specified.

I'm torn about whether edges should be a kwarg or not. I guess it is OK that its second because np.histogram returns values, bins_edges in that order, even if it feels a little funny to me to have the independent variable come before the dependent.

I'm also not sure about your choice of default. I guess I'd expect, by default, the bins would be centred on an integer (i.e. +/-0.5 of what you chose).

@timhoffm
Copy link
Member

Parameter naming

I agree with @jklymak in that I favor edges. Additional arguments:

  • bins is semantically tied to histograms, which is a more narrow interpretation of a step line than what I'd like this function and Patch class to be. For the same reason I don't count aligning with the hist() signature as an advantage.
  • The main usage for bins in hist() is bins=N. The extension of that parameter to a sequence of edges is semantically not exact, but a reasonable choice for an extension of the API. The imprecision in naming is the smaller evil there compared to having mutually exclusive bins and bin_edges arguments). For our new function, we don't have that burden and can directly use the precise name edges.

Default edges

I'm not clear what a good default would be np.arange(len(vals)+1) and np.arange(len(vals)+1) + 0.5 both have their justification. Partly that's the reason I'd leave out a default by now. Note that we can always add a default later.

Parameter order

+/-0

My interpretation of a step line is more that of a piece-wise constant function, which favors (edges, values). Note that the order is not necessarily dictated by defaults. We also support plot(x, y) and plot(y), allowing for a default x. This is a bit off the standard python default mechanism, but with plot() we have prominent precedence and it can reasonable if non-default is the dominating use case.

OTOH stepline(*np.histogram(x)) is a compelling argument for the other order.

@andrzejnovak
Copy link
Contributor Author

andrzejnovak commented Aug 20, 2020

I agree with you that edges is a more correct term, but I think it's more valuable to keep consistent with to choices made in np.histogram and plt.hist. Ultimately the array you would pass here is the same array you would pass those functions, where it would be called bins. I understand the fcn/artist could be used more generally, but I think it's reasonable to say that histograms are going to be responsible for the bulk of its calls.

Regarding the param order and presence of a default, you mention the 2 points that I was going to. I just want to add that I am still disappointed at times when I forget that scatter doesn't behave like plot, and the fact that plot(y) is acceptable syntax is awesome convenience.

There is an additional value in making this stepline(*np.histogram(x)) syntax possible. There is a number of histogram formats/libraries to name at least two from the HEP side we have ROOT and now https://github.com/scikit-hep/boost-histogram. These don't necessarily talk to each other but typically they all have a to_numpy() fcn, which means you could do stepline(*random_histogram_object.to_numpy()), which I think is super valuable as well - again due to convenience. An added benefit of the allowing bins as an arg/kwarg in the signature stepline(vals, bins=None, *, ....) is that for a lot of use cases I have in mind, the actual choice of the name wouldn't matter.

I have to admit I find np.arange(len(vals)+1) + 0.5 really weird in this context and reminiscent of bar logic - e.g. 5 values and range(1, 6, 1) "bin centers". I would sooner understand np.linspace(0, 1, len(vals)+1)

@andrzejnovak andrzejnovak changed the title feat: HistLine feat: StepPatch Aug 21, 2020
@andrzejnovak
Copy link
Contributor Author

  • Are people firm on not allowing to infer orientation from whether the first or the second arg is longer? We could also dispense with the semantics discussion and simply have y-like, x-like.
  • What's the proper use for tools/boilerplate.py, I tried just executing as is or passing the path to pyplot.py, but it doesn't seem to do its thing.
  • When adding the tests I realized that compared to the step-between as drawstyle [Alternative approach to #15019] #15065 implementation this method doesn't allow to have a variable lower base (shows up when using fill=True) a.k.a y2 of fill_between. I don't think it's necessary for my intended use cases, but worth mentioning.

@timhoffm
Copy link
Member

  • Are people firm on not allowing to infer orientation from whether the first or the second arg is longer? We could also dispense with the semantics discussion and simply have y-like, x-like.

Yes, that's too much magic. Even more so: Our discussion shows that there are arguments for both oderings (values-edges and edges-values). People thinking in the convention not chosen by us would surprisingly get the other orientation.
We already have orientation parameters, so this way is also good in terms of API consistency.

  • What's the proper use for tools/boilerplate.py, I tried just executing as is or passing the path to pyplot.py, but it doesn't seem to do its thing.

You have to manually define the _axes_commands to be mapped to pyplot:

_axes_commands = (

That's ok. I don't think variable baseline is an important use case. One can still later expand the API by supporting an array as baseline parameter.

@jklymak
Copy link
Member

jklymak commented Aug 21, 2020

I think adding a baseline is actually pretty important. You are going to get complaints about zero as soon as someone does set_yscale('log'). And actually, you should add a test for that case...

@andrzejnovak
Copy link
Contributor Author

andrzejnovak commented Aug 21, 2020

I think adding a baseline is actually pretty important. You are going to get complaints about zero as soon as someone does set_yscale('log'). And actually, you should add a test for that case...

I am not sure what you mean. There is a baseline and sticky edges and it seems to work fine with log, the test is there. The question was about having the option of baseline being an array of len(vals).

# The use of the following functions, methods, classes and modules is shown
# in this example:

import matplotlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to suppress this flake8 in .flake8

@jklymak
Copy link
Member

jklymak commented Aug 21, 2020

Sorry, I should have read the test more closely - search doesn't work for some reason ;-) And yes, a single value for baseline seems fine. Shouldn't comment before coffee...

Just to be clear, np.histogram uses bins as an argument, but bin_edges as the return value. I still prefer edges to bins as the name for the argument, but would also be fine with bin_edges. But I wouldn't block if there was further support for bins.

@andrzejnovak
Copy link
Contributor Author

Sorry, I should have read the test more closely - search doesn't work for some reason ;-) And yes, a single value for baseline seems fine. Shouldn't comment before coffee...

Just to be clear, np.histogram uses bins as an argument, but bin_edges as the return value. I still prefer edges to bins as the name for the argument, but would also be fine with bin_edges. But I wouldn't block if there was further support for bins.

Hmm interesting, in that case “bin_edges" is also viable option. I've changed the StepPatch kwarg to edges because it makes sense to me, that that object would be decoupled from the histogram meaning.

Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StepLine essentially looks good.

Still have to think about histline(). Do we want that semantically connotated name or do we want stepline()? Will get back to you on this.

@andrzejnovak
Copy link
Contributor Author

I think StepPatch (not StepLine) makes sense.

I was thinking about the plotting fcn name, but I think here it should be possible to go more specialized - i.e. have a histline. There are some important assumptions being made, that would not necessarily be true for a generic step function, like setting sticky_edges or color - it's unlikely one would want to set a histogram face and edge colour simultaneously to different values. A more generic fcn with different behaviour w.r.t kwargs/colors/edges could always be added later on when there is a clear use case.

@andrzejnovak
Copy link
Contributor Author

Any hints on what needs to be changed to pass the docs?

@jklymak
Copy link
Member

jklymak commented Aug 23, 2020

If you click through to the error and then download the full file you can search for "Warning" to finally get:


/home/circleci/project/lib/matplotlib/pyplot.py:docstring of matplotlib.pyplot.histline:18: WARNING: py:obj reference target not found: fill=True
/home/circleci/project/lib/matplotlib/pyplot.py:docstring of matplotlib.pyplot.histline:29: WARNING: py:obj reference target not found: patches.StepPatch
/home/circleci/project/lib/matplotlib/pyplot.py:docstring of matplotlib.pyplot.histline:34: WARNING: py:obj reference target not found: matplotlib.patches.StepPatch
/home/circleci/project/lib/matplotlib/legend_handler.py:docstring of matplotlib.legend_handler.HandlerStepPatch:2: WARNING: py:obj reference target not found: patches.StepPatch
/home/circleci/project/doc/devel/documenting_mpl.rst:1008: WARNING: py:class reference target not found: matplotlib.patches.StepPatch
/home/circleci/project/doc/gallery/lines_bars_and_markers/histline_demo.rst:16: WARNING: py:obj reference target not found: patches.StepPatch
/home/circleci/project/doc/users/next_whats_new/steppatch_and_histline.rst:1: WARNING: py:obj reference target not found: patches.StepPatch

@andrzejnovak
Copy link
Contributor Author

andrzejnovak commented Aug 24, 2020

Yeah, I was able to find that. But I'm not entirely sure why it doesn't work. I've modeled the way I referenced stuff from other mpl scripts, apparently missing some bit that actually makes it work.

Other Parameters
----------------
**kwargs
`~matplotlib.patches.StepPatch` properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`~matplotlib.patches.StepPatch` properties
`~.matplotlib.patches.StepPatch` properties

Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thank's @andrzejnovak for following through with all our requested changes. This is an extremely useful feature, but we had to make sure we get it right. Thanks all for the API discussions!

While there are still some minor things to decide and improve, this PR has seen enough discussion and is generally in good shape. I suggest to merge this as is and incrementally improve via follow-up PRs. These can then be done by anybody and don't necessarily burden @andrzejnovak, who has already done a great job.

For follow-up PRs I consider the following topics:

  • Comparison of .pyplot.step and .pyplot.stairs could use some explanatory text. addressed in Documentation improvements for stairs() #18543
  • I'm not entirely sure why the default StepPatch example renders with fill color and edges. I thought that default patches render without edge lines. To be checked.
  • Some docstring parameters don't have a description. Even if they state something apparently obvious, they should be explicit. addressed in Documentation improvements for stairs() #18543
  • From an API perspective: Do we need both get/set_data() and get/set_values/edges()? They are redundant. It might be better to reduce the API and either make get/set_data() private or remove get/set_values/edges(). I'm not sure which way is best.

@QuLogic
Copy link
Member

QuLogic commented Sep 15, 2020

I'm not entirely sure why the default StepPatch example renders with fill color and edges. I thought that default patches render without edge lines. To be checked.

Patch defaults to no edges, but PathPatch does not, and StepPatch derives from the latter. Set class attribute _edge_default = False if you want to do the opposite.

Copy link
Member

@efiring efiring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying an error message before merging, but won't insist on it. Overall, this is a nice addition.

@pytest.mark.xfail
def test_stairs_invalid_update():
h = plt.stairs([1, 2], [0, 1])
h.set_edges(np.arange(5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to do w/ these tests? They are failing before the set_edges gets run... I've not seen us use xfail tests very often in the test base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, but yeah the idea is just to make sure stairs fail when the values are invalid.

if orientation == 'vertical':
patch.sticky_edges.y.append(baseline)
else:
patch.sticky_edges.x.append(baseline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just looking from the peanut gallery...

  1. do you actually want a sticky at y=0 if baseline is None?
  2. adding the sticky could be done as part of the StepPatch implementation (you could even try to keep it in sync in set_baseline, although it's not necessarily clear what to do if the user manually changed the value in the meantime, so... perhaps not). I realize other classes don't do that, although for many of them that's because they're too general for the class to really know what stickies to use (e.g. a Rectangle does not know it's being used as part of a histogram).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is intended. I think the implied consensus was that the naming would be general but some histogram like assumptions could be made in the plotting function. That's why sticky is in stairs and not in StepPatch. I suppose that means the update of StepPatch is not nicely behaved wrt baseline sticky, but then again neither are lims

@efiring
Copy link
Member

efiring commented Sep 17, 2020

In the spirit of @timhoffm #18275 (review), I'm going to merge this. It looks like the subsequent comments have been addressed.

@efiring efiring merged commit 33f95db into matplotlib:master Sep 17, 2020
@efiring
Copy link
Member

efiring commented Sep 17, 2020

@timhoffm would you like to make an issue based on your list of possible follow-ups? I think that the API question about whether to retain all of the ways of setting values and edges should be decided before release.

@timhoffm
Copy link
Member

Will do. Part of that might directly be a PR.

@ianhi
Copy link
Contributor

ianhi commented Sep 18, 2020

From an API perspective: Do we need both get/set_data() and get/set_values/edges()? They are redundant. It might be better to reduce the API and either make get/set_data() private or remove get/set_values/edges(). I'm not sure which way is best.

Can get/set_data be kept public, or if not can it be made possible to change the number of bins with get/set_values/edges? The ability to change the number of bins is crucial for animations/interactions. #18275 (comment)

@andrzejnovak
Copy link
Contributor Author

@ianhi I think we established that set/get_data needs to be the fundamental method to allow changing number of bins. The discussion seems to be about the auxiliary ones. I've implemented get/set_data, set_values and set_edges, because imo get_data[0] is easy enough to do to justify not having get_values()

@HDembinski
Copy link

HDembinski commented Sep 23, 2020

@timhoffm Sorry to be late to this discussion, but I think the pyplot method should be "stair" instead of "stairs". Similar to "step" which is also not called "steps".

According to https://www.merriam-webster.com/dictionary/stair, a "stair" is already a series of steps.

@QuLogic
Copy link
Member

QuLogic commented Sep 23, 2020

From your link:

1 : a series of steps or flights of steps for passing from one level to another — often used in plural but singular or plural in construction
2 : a single step of a stairway

"stairs" is a weird word in English that is always written in plural, unless you are referring to a single step.

@timhoffm
Copy link
Member

@HDembinski thanks for your feedback. We've extensively discussed the name in the developer call, in particular also stair vs. stairs.

It is acknowledged that most of our plotting function (but not all, c.f. barbs, axhlines) are singular. As @QuLogic has pointed out the usage of singular/plural for stair is particular in English - I assume it has to do with "stairs" being considered a single entity, a grouped set of steps. This legitimates considering to use the plural. Then, search for "stair plot" vs. "stairs plot" on google gives you a popularity ranking of 8M to 70M results, also MATLAB and Octave use stairs. So the plural was finally chosen as the hopefully more common name.

@timhoffm timhoffm mentioned this pull request Sep 25, 2020
@ianhi ianhi mentioned this pull request Oct 2, 2020
7 tasks
@jklymak
Copy link
Member

jklymak commented Nov 24, 2020

This didn't get milestones. Was it 3.4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Functions for "manually" plotting histograms
10 participants