-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
DOC: normalizing histograms #27426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: normalizing histograms #27426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of good information here, but honestly it feels a bit overwhelming. I think adding some headings in the normalizing bins sections highlighting what you're trying to show in each subsection may help anchor the reader & honestly make it more likely they won't just skim over the thing but actually look at the section that's relevant for 'em. ETA: same w. the code honestly -> separating the plotting from the labeling code a bit more might make it easier to single out what functionality is trying to be highlighted here.
# to make the point very obvious, consider bins that do not have the same | ||
# spacing. By normalizing by density, we preserve the shape of the | ||
# distribution, whereas if we do not, then the wider bins have much higher | ||
# values than the thin bins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is supposed to be about appropriate bin choice, then separate that out into it's own example w/ a side by side of equally and irregularly spaced bins? Basically I'm reading this trying to visualize the point you're trying to make, so you may as well just visualize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum, I had to read this a couple of times to understand what was going on because of the way that the top sentence was breaking. I'm honestly still not sure if this is what I'm supposed to parse out of this:
By normalizing by density, we preserve the shape of the
distribution. We emphasize this point using irregularly spaced bins to show that in the unnormalized example the frequencies are much more sensitive to bin width.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new section before this, as I agree this was too big a leap. Sorry - I was still playing with it, and should have marked as draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you did mark it draft - I should have asked if draft meant ready for feedback. I mark draft as an I don't think it's ready to merge, but ready for reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry and ok, got that for next time. Limit of github that there isn't a third -> ready for feedback but still nascent option :/ ETA: meaning I take that review very literally as ready for final it can be merged on approval review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to solicit review for a draft PR, but I just use Draft to abuse CI. Maybe a bad habit, but nicer for a server farm to build the docs for me than doing it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair - code spaces might also be really good for your use case and faster.
I'm thinking about proposing adding a field to the PR template asking `if draft, would you like feedback?[], anything in particular" b/c I think this might be a fairly common expectation mismatch (half of us are drafts are for early round feedback while messing around and encourage folks to use it as such, half of us are drafts are messing around before feedback please don't touch) and it's one of those things I don't think folks necessarily think to communicate or ask about.
981021d
to
3e98181
Compare
e00fe2e
to
aa52a1d
Compare
I spent some time reading up on this the other day, so I think this is a valuable addition! (I wanted the probability mass function it turned out.) |
a0905d4
to
b1afe65
Compare
# %% | ||
# This normalization can be a little hard to interpret when just exploring the | ||
# data. The value attached to each bar is divided by the total number of data | ||
# points _and_ the width of the bin, and thus the values _integrate_ to one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# points _and_ the width of the bin, and thus the values _integrate_ to one | |
# points *and* the width of the bin, and thus the values *integrate* to one |
we are in rst
not md
here
# e.g. (``density = counts / (sum(counts) * np.diff(bins))``), | ||
# and (``np.sum(density * np.diff(bins)) == 1``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# e.g. (``density = counts / (sum(counts) * np.diff(bins))``), | |
# and (``np.sum(density * np.diff(bins)) == 1``). | |
# e.g. :: | |
# | |
# density = counts / (sum(counts) * np.diff(bins)) | |
# np.sum(density * np.diff(bins)) == 1 |
The in-line code snippets are hard to read the way they got wrapped, do as a code block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two small style comments, but this is a significant improvement even without them.
@jklymak can self-merge with or without my suggestions.
The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues with working memory are really common for folks with dyslexia, burnout, or ADHD (👋). A very very low hanging fruit way to make the docs more accessible is some anchoring through titles and headings. Nothing complicated, just tell folks what they should be focusing on/picking out from a subsection/example. Smaller/cleaner examples would be great too, but subheadings are I don't think a big enough ask to be out of scope & much easier to put in now then on a new cycle. ETA: And I'm specifically making this ask of Jody b/c he's an experienced educator and technical writer - I wouldn't necessarily make this ask in other contexts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm requesting changes b/c the .rst up top is over indented and that needs to be fixed.
- bin the data as you want, either with an automatically chosen number of | ||
bins, or with fixed bin edges, | ||
- normalize the histogram so that its integral is one, | ||
- and assign weights to the data points, so that each data point affects the | ||
count in its bin differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- bin the data as you want, either with an automatically chosen number of | |
bins, or with fixed bin edges, | |
- normalize the histogram so that its integral is one, | |
- and assign weights to the data points, so that each data point affects the | |
count in its bin differently. | |
- bin the data as you want, either with an automatically chosen number of | |
bins, or with fixed bin edges, | |
- normalize the histogram so that its integral is one, | |
- and assign weights to the data points, so that each data point affects the | |
count in its bin differently. |
fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained') | ||
dx = 0.1 | ||
xbins = np.arange(-4, 4, dx) | ||
ax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the bin width comparisons, it's kind of hard to tell the bin widths from the 'step' type - is there a way to actually show the bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With so many bins, vertical lines end up almost merging. Fewer bins, it's hard to see the normal distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about stacking as rows instead of columns so you have more horizontal space to work in?
xbins = np.arange(-4, 4, dx) | ||
# expected histogram: | ||
ax['False'].plot(xpdf, pdf*1000*dx, '--', color=f'C{nn}') | ||
ax['False'].hist(xdata, bins=xbins, density=False, histtype='step') | ||
|
||
ax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label=dx) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need both here and all three? only because the busyness makes it feel very cluttered in a way where it's hard to read off the lesson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main point of why you want to normalize, so comparing and contrasting with and without normalizing is the goal. Multiple bin sizes is to better give the reader an idea of how the bin size affects their results. Sure, it's busy, but I don't think incomprehensibly so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe same as above, stacking horizontally ? or maybe thinner lines + making the histogram colors paler so that it's easier to visually distinguish?
The goal of this example is not to provide a random-access reference. That is already in the API docs. The goal is to explain carefully what the normalization options do so that we can point folks who ask why we normalize histograms the way we do towards this reference for an explanation. I consider this a relatively advanced topic, and I think breaking it into more sections or subsections would be more distracting than helpful. |
That's mutually exclusive from its accessibility in the universal design context? I'm not saying make the text accessible to folks who don't have the mathematical background or Python knowledge to parse it, I'm asking that you make it more accessible to folks who's brains may be wired a bit different. Plenty of writing on advanced topics (and most academic papers) break things out - a great example is 7 Stages in Compositionality, which is intro to category theory
Is it distracting for you? Because then it's competing needs and we should figure out if we can find something that works for both of us. The lack of sections is super distracting for me, b/c I don't know where to focus or where one part ends or the other begins. And as someone who frequently links folks to our docs, it's nice when I have a specific place to do so. Reading through it again, I'm guessing the following, which if I'm correct I don't see the harm in putting this in the document & if I'm wrong it would be helpful to have that table setting in the document.
ETA: Also I could reverse outline it b/c I've read this document a bunch of times now and reverse outlined it out of frustration b/c I was trying to understand the structure. And it's clearly well structured, and my hypothesis is folks would be more enticed to read it if they could see at a scan from the outline that it's well structured. |
# (``density = counts / (sum(counts) * np.diff(bins))``) | ||
# (``np.sum(density * np.diff(bins)) == 1``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't compiling correctly.
a305177
to
f2da1f0
Compare
Also I think this is a conceptual tutorial on normalizing histograms far more than a gallery example on how to use the function. Basically going by the rough distinction on the index page of: Demo-> plot type and example galleries I think this is far more usage than demo. |
Thanks for the discussion. I'm going to decline to modify this example further at this time. I think the current contribution crosses the bar of being a helpful addition. I'd suggest further changes could be follow-up PRs. |
If that's true, then why not do it here? There's no urgency to get this in. ETA: Also why not move it to tutorials, where it'll be easier to find? |
56b4b24
to
b58498d
Compare
@story645 I do not agree with your changes. Is there a reason you are force pushing onto my branch? |
b58498d
to
f2da1f0
Compare
Accident so I undid it? You can rebase to drop me from the commit history but I had to rebase to drop - I was using the gh cli & didn't realize it would just push back to yours. |
I'll self merge based on #27426 (review) |
Reviewing this was on my todo list for yesterday and this morning, sorry I did not get to it in time. |
@tacaswell, it didn't change from your previous review except for a typo. |
People often seem confused by the density kwarg of
hist
. I don't think we should change it, but we could document better.This was similar enough to
histogram_features
that I removed that and added a redirect.