-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Doc: follow up on histogram normalization example #27459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
56b4b24
to
9fdb0cd
Compare
aec07ce
to
de7b768
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The figure width comment applies to all figures, but I did not feel like commenting on all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, submitted a bit early.
I also think using 'False'
/'True'
instead of False
/True
is a bit weird, but that's not new here.
ax.hist(xdata, bins=xbins, weights=1/len(xdata) * np.ones(len(xdata)), | ||
histtype='step', label=f'{dx}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be equivalent:
ax.hist(xdata, bins=xbins, weights=1/len(xdata) * np.ones(len(xdata)), | |
histtype='step', label=f'{dx}') | |
ax.hist(xdata, bins=xbins, weights=np.full_like(xdata, 1 / len(xdata)), | |
histtype='step', label=f'{dx}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think np.ones(N)/N
is also equivalent, but I thought the goal was to make it clear that the scale factor here is 1/N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/N is still shown in my suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I don't see how your suggestion is clearer since full_like isn't a common method (also when I tried it I got all 0s since I didn't do 1./N)
1d65522
to
e9c7d92
Compare
ToDo: I wanna pull out the one line that's the histogram before or after the code block to more clearly isolate/explain how to do the things. - done |
e9c7d92
to
165a3f2
Compare
I have several substantial concerns about this. The bulk of these changes were proposed and rejected as part of the discussion in #27426. I don't see any discussion of what is different now. If this content is moved, it should probably be moved to "exp}lain" not "tutorial" as it is (was) an explanation of what each of the (many) imports to While discussion of cognitive load is reasonable, I have concerns with it being used as the primary justification of these changes. While in some cases increasing the structure can help, introducing the wrong structure can also make it worse. Previously the two top level sections were "I want to change the bin edges" (which is only |
# weights | ||
# ======= | ||
# | ||
# Sometimes people want to normalize so that the sum of counts is one. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is underselling why the weight
keyword exists. If we are going to go with this re-organization the subsections need to be individually complete and correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at all the changes here, but agreed that the point of the example was not to explain all the uses of the weight parameter, but rather to point out that it can be used to get the effect of normalizing by N which people sometimes want.
As linked to at the very top of this PR, @jklymak said to put these changes in a follow up PR:
The reason for separating them is that the Matplotlib library provides them as separate independent keywords so someone scanning this doc can see "oh density, and here's an explanation", "oh weights, here's an explanation" -> that separation enforces that these are distinct parameters used to achieve distinct goals & that's why the API provides both as distinct entities.
The explanations of what weights does and what density does are still under the respective parameter subheadings, so can you please explain what you mean here? Frankly I think the "why I want to change the meaning of the bin values" is completely out of scope for the Matplotlib docs because that's data/stats, but I can see the value in it so I left it as case studies. I'll flesh out that intro to that section/change the title to make that clearer, but I think for the Matplotlib docs we want to separate the "how does it work" from the "why do I want to do this" b/c the why is different for every user and many users know their why but still need to know our how. ETA: also I think it's a good idea to put back a higher level "computing bin values" before the density and weights section and I'll put that back. |
I could see that, in that the example it's most similar to is https://matplotlib.org/devdocs/users/explain/artists/imshow_extent.html and I don't mind it moving there. But also, the rewrite is in line with the diataxis 'tutorial' style
it works through the parameters one by one building them up, with repetition and concrete images at every step and It breaks up the explanations into very small chunks paired with immediate concrete examples. Even the pulling out the code before the larger examples is to prime the reader to see it in the larger example-> again repetition + small steps. (ETA: it's even written in the we and that was @jklymak ) Which, a related but separate discussion is untangling tutorials from user guide because currently they're explicitly intermixed in a way where it's very unclear which one a user should go to. |
165a3f2
to
215411a
Compare
The original goal of this example was to give people an idea how to make certain types of histograms, not explain the API, which is what the API docs should do. In particular, the original goal was to document how you get the probability mass function, since people are often confused that is not what An example of using weights for other uses would probably be fine, since the scope of the example broadened somewhat as it was composed. I don't think this belongs in Tutorials. Having Tutorials creep back into being a catchall for things we want to explain would be an organizational step backwards. I think it belongs in Examples, because we do not have in-depth examples like this in the Users guide, and adding a bunch of very specialized ones like this would be hard to do in a balanced fashion. |
I definitely don't think it belongs in the user guide - but I also don't think that the imshow guide belongs in the user guide either. You said yourself elsewhere that you think of the gallery as short stack overflow style demos -> which an explanation of histogram normalization is not.
What I'm proposing is tutorials be the place for long form explanatory tutorials of how to use our plotting methods...which yes means pulling some things back from the user guide. When I'm thinking clean separation and scope, my mind goes to:
Granted, what I really would like is to maybe do this sorting out as another GSOD, but there's no way it'll be successful if we can't compromise somehow. I think for the most part the bulk of the work ends up being cleaning up framing so that it's consistent within a section, with some moving around based on agreed upon content guidelines.
I think it's nice to have docs that show how all the API parameters work in one place/how to use those parameters (this is a thing folks ask for a lot - how do I use pie) & that in the API docs this kinda step by step w/ pictures would be majorly distracting/annoying for folks who already know how to use hist & just want to double check an argument - which that's the instructive vs reference distinction.
Then it should have been scoped to that, but the scope grew and it became a tutorial on histogram binning. Fundementally what I'm aiming at here is reframing this document so that it's primarily focused on library functionality. I was very careful to not remove any of the content you wrote. |
I don't agree with that as an endpoint for the user guide. The user guide should be how to use Matplotlib, as completely as we can manage. To me, that eventually would include "how do I use the histogram method". Tutorials should be like the "Lifecycle of a plot" variety, or other online tutorials that take the reader on a complete guided journey. Examples should be standalone, and relatively short. This particular example is long, so I agree it is borderline. An advantage to leaving it in "examples" is that it puts all the
Not clear that is a problem - you need to understand the binning if you were going to understand the density arguments because they interact so strongly. I mean I guess there could have been two separate examples, but that doesn't seem necessary |
Yeah, I'm proposing complete guided journeys of the plotting methods. The feedback I've consistently heard on the lifecycle tutorial is that we ruined what made it good & now nobody understands what it's for. And what do you mean by other online tutorials? What I'm struggling with is what would be the paragraph at the top of "user guide" and on top of "tutorial" telling the user what they could expect from each page. Granted, I also wouldn't mind getting rid of tutorials all together if we can't come to a consensus on its purpose.
We can also use tags to do that.
Sort of the intent behind restructuring in this way - it provides a natural way to slot in the other hist parameters.
Two separate examples means that there's a binning example that we could link to/cross reference across every example where it may be helpful to know about binning. If I was gonna draw an n-level deep toctree of the docs, the implication would be that binning is a topic that's important only for normalization because that's where it's explained - a big motivator for a lot of what I'm proposing is to try and prevent users from conflating topics. ETA: one of my major meta motivations is that I think it's much easier to see what docs are missing if we've got our docs well organized. And to organize our docs, we have to collectively decide on what we think should go where. ETA 2: Granted, I also think we've had so many back and forths on this and are at a total impasse & it's probably more constructive to hash this out on a call in January at this point. |
Also, just for the record, I'm not really married to the exact user guide/tutorial bins that I'm proposing here - @melissawm and @esibinga weren't super keen on my exact division either. I'm concerned about putting plotting function tutorials in user guide b/c a) I think we agree that they should be written as tutorials even if we disagree on what that means, b) we've got 60 odd methods, 30 or so in the plot type gallery, and that's a lot of content to add the guide c) my meta proposal is that the guide cover the topics that apply to an arbitrary plot -> figure/axes/formatting/etc - while plotting method usage is guided by individual needs (for example, I've never needed to make a contour plot, but yes under this definition something like the imshow guide would stay under user guide but expanded to make clear it applies to all the AxesImage plot types, which # ties back to underlying object as the organizing principal). My priority is that each section of the docs have a clear distinct identity, scope, & pedagogical approach (ETA: which is what the diataxes approach is fundementally advocating) so that we can sort documents, improve cohesiveness within a section, and reduce duplication/redundancy. The motivation for this PR was that the contents of the document changed scope within the original (pulling back to being the histogram features example this replaced) & this attempts to update the structure to match that change in scope. Functionally, this is what I've been doing with basically all my doc PRs lately - sorting and binning to make the distribution of docs/topics/content cleaner. ETA: This is b/c I had thought that there was long standing consensus that the major issue w/ the docs is the information architecture, so I was approaching this PR w/ that in mind. I mostly do not have major concerns w/ the original PR as a standalone document, but within the context of the rest of the documentation I think it is too much a tutorial on histogram to be an example (bins don't need to be explained for a demo of using hist for norming) but as a tutorial is too focused on the statistics/why use histograms and not enough on library functionality and I'm fairly certain our goal is not to recreate https://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htm |
215411a
to
ca4675e
Compare
Pull tutorial content out of example and into standalone tutorial. Co-authored-by: Elliott Sales de Andrade <quantum.analyst@gmail.com>
ca4675e
to
a15c661
Compare
makes more sense to rework this w/ consolidating the other histogram examples in #27569 so shifting back to draft. |
PR summary
Per #27426 (comment), PR aimed at reducing some of the cognitive load required to understand this example by adding:
As mentioned in #27426 (comment), reducing cognitive load makes the example more accessible (in a DEI way), which is in keeping with our CoC and mission. Universal design principals are that material that is more accessible tends to lead to better learning outcomes even for folks who don't need the accommodation.
I also moved it to tutorials because on https://matplotlib.org/devdocs/ we've started making the distinction that
This document is very much focused on why and how normalization works and when to use it rather than a demonstration of features (though this PR revision emphasizes the parameters) and I think the more we start scoping things the easier it'll be to start building identities around each section of the docs, which hopefully will help w/ discoverability.
-->
PR checklist