Skip to content

DOC: normalizing histograms #27426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 7, 2023

Conversation

jklymak
Copy link
Member

@jklymak jklymak commented Dec 2, 2023

People often seem confused by the density kwarg of hist. I don't think we should change it, but we could document better.

This was similar enough to histogram_features that I removed that and added a redirect.

@jklymak jklymak added the Documentation: examples files in galleries/examples label Dec 2, 2023
Copy link
Member

@story645 story645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of good information here, but honestly it feels a bit overwhelming. I think adding some headings in the normalizing bins sections highlighting what you're trying to show in each subsection may help anchor the reader & honestly make it more likely they won't just skim over the thing but actually look at the section that's relevant for 'em. ETA: same w. the code honestly -> separating the plotting from the labeling code a bit more might make it easier to single out what functionality is trying to be highlighted here.

Comment on lines 106 to 110
# to make the point very obvious, consider bins that do not have the same
# spacing. By normalizing by density, we preserve the shape of the
# distribution, whereas if we do not, then the wider bins have much higher
# values than the thin bins:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be about appropriate bin choice, then separate that out into it's own example w/ a side by side of equally and irregularly spaced bins? Basically I'm reading this trying to visualize the point you're trying to make, so you may as well just visualize it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum, I had to read this a couple of times to understand what was going on because of the way that the top sentence was breaking. I'm honestly still not sure if this is what I'm supposed to parse out of this:

By normalizing by density, we preserve the shape of the
distribution. We emphasize this point using  irregularly spaced bins to show that in the unnormalized example the frequencies are much more sensitive to bin width. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new section before this, as I agree this was too big a leap. Sorry - I was still playing with it, and should have marked as draft.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you did mark it draft - I should have asked if draft meant ready for feedback. I mark draft as an I don't think it's ready to merge, but ready for reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider Draft to mean not ready for review: Draft

Copy link
Member

@story645 story645 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry and ok, got that for next time. Limit of github that there isn't a third -> ready for feedback but still nascent option :/ ETA: meaning I take that review very literally as ready for final it can be merged on approval review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to solicit review for a draft PR, but I just use Draft to abuse CI. Maybe a bad habit, but nicer for a server farm to build the docs for me than doing it locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair - code spaces might also be really good for your use case and faster.

I'm thinking about proposing adding a field to the PR template asking `if draft, would you like feedback?[], anything in particular" b/c I think this might be a fairly common expectation mismatch (half of us are drafts are for early round feedback while messing around and encourage folks to use it as such, half of us are drafts are messing around before feedback please don't touch) and it's one of those things I don't think folks necessarily think to communicate or ask about.

@jklymak jklymak force-pushed the doc-histogram-normalizations branch from 981021d to 3e98181 Compare December 3, 2023 16:54
@jklymak jklymak marked this pull request as draft December 3, 2023 18:20
@jklymak jklymak force-pushed the doc-histogram-normalizations branch 2 times, most recently from e00fe2e to aa52a1d Compare December 4, 2023 03:52
@oscargus
Copy link
Member

oscargus commented Dec 4, 2023

I spent some time reading up on this the other day, so I think this is a valuable addition! (I wanted the probability mass function it turned out.)

@jklymak jklymak force-pushed the doc-histogram-normalizations branch 2 times, most recently from a0905d4 to b1afe65 Compare December 4, 2023 23:34
@jklymak jklymak marked this pull request as ready for review December 5, 2023 00:29
@tacaswell tacaswell added this to the v3.9.0 milestone Dec 5, 2023
# %%
# This normalization can be a little hard to interpret when just exploring the
# data. The value attached to each bar is divided by the total number of data
# points _and_ the width of the bin, and thus the values _integrate_ to one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# points _and_ the width of the bin, and thus the values _integrate_ to one
# points *and* the width of the bin, and thus the values *integrate* to one

we are in rst not md here

Comment on lines 97 to 98
# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),
# and (``np.sum(density * np.diff(bins)) == 1``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),
# and (``np.sum(density * np.diff(bins)) == 1``).
# e.g. ::
#
# density = counts / (sum(counts) * np.diff(bins))
# np.sum(density * np.diff(bins)) == 1

The in-line code snippets are hard to read the way they got wrapped, do as a code block?

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two small style comments, but this is a significant improvement even without them.

@jklymak can self-merge with or without my suggestions.

@story645
Copy link
Member

story645 commented Dec 5, 2023

The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues with working memory are really common for folks with dyslexia, burnout, or ADHD (👋).

A very very low hanging fruit way to make the docs more accessible is some anchoring through titles and headings. Nothing complicated, just tell folks what they should be focusing on/picking out from a subsection/example. Smaller/cleaner examples would be great too, but subheadings are I don't think a big enough ask to be out of scope & much easier to put in now then on a new cycle. ETA: And I'm specifically making this ask of Jody b/c he's an experienced educator and technical writer - I wouldn't necessarily make this ask in other contexts.

Copy link
Member

@story645 story645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm requesting changes b/c the .rst up top is over indented and that needs to be fixed.

Comment on lines 12 to 16
- bin the data as you want, either with an automatically chosen number of
bins, or with fixed bin edges,
- normalize the histogram so that its integral is one,
- and assign weights to the data points, so that each data point affects the
count in its bin differently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- bin the data as you want, either with an automatically chosen number of
bins, or with fixed bin edges,
- normalize the histogram so that its integral is one,
- and assign weights to the data points, so that each data point affects the
count in its bin differently.
- bin the data as you want, either with an automatically chosen number of
bins, or with fixed bin edges,
- normalize the histogram so that its integral is one,
- and assign weights to the data points, so that each data point affects the
count in its bin differently.

fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained')
dx = 0.1
xbins = np.arange(-4, 4, dx)
ax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the bin width comparisons, it's kind of hard to tell the bin widths from the 'step' type - is there a way to actually show the bins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With so many bins, vertical lines end up almost merging. Fewer bins, it's hard to see the normal distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about stacking as rows instead of columns so you have more horizontal space to work in?

Comment on lines +172 to +180
xbins = np.arange(-4, 4, dx)
# expected histogram:
ax['False'].plot(xpdf, pdf*1000*dx, '--', color=f'C{nn}')
ax['False'].hist(xdata, bins=xbins, density=False, histtype='step')

ax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label=dx)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need both here and all three? only because the busyness makes it feel very cluttered in a way where it's hard to read off the lesson

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main point of why you want to normalize, so comparing and contrasting with and without normalizing is the goal. Multiple bin sizes is to better give the reader an idea of how the bin size affects their results. Sure, it's busy, but I don't think incomprehensibly so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe same as above, stacking horizontally ? or maybe thinner lines + making the histogram colors paler so that it's easier to visually distinguish?

@jklymak
Copy link
Member Author

jklymak commented Dec 6, 2023

The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues with working memory are really common for folks with dyslexia, burnout, or ADHD (👋).

The goal of this example is not to provide a random-access reference. That is already in the API docs. The goal is to explain carefully what the normalization options do so that we can point folks who ask why we normalize histograms the way we do towards this reference for an explanation. I consider this a relatively advanced topic, and I think breaking it into more sections or subsections would be more distracting than helpful.

@story645
Copy link
Member

story645 commented Dec 6, 2023

I consider this a relatively advanced topic

That's mutually exclusive from its accessibility in the universal design context?

I'm not saying make the text accessible to folks who don't have the mathematical background or Python knowledge to parse it, I'm asking that you make it more accessible to folks who's brains may be wired a bit different.

Plenty of writing on advanced topics (and most academic papers) break things out - a great example is 7 Stages in Compositionality, which is intro to category theory

I think breaking it into more sections or subsections would be more distracting than helpful.

Is it distracting for you? Because then it's competing needs and we should figure out if we can find something that works for both of us. The lack of sections is super distracting for me, b/c I don't know where to focus or where one part ends or the other begins. And as someone who frequently links folks to our docs, it's nice when I have a specific place to do so.

Reading through it again, I'm guessing the following, which if I'm correct I don't see the harm in putting this in the document & if I'm wrong it would be helpful to have that table setting in the document.

  1. Choosing Bins
    1. passing in bin edges
    2. passing in number of bins
  2. Normalizing Histograms
    1. density = True and scaling
      1. explanation of integration
      2. use case: preserve shape
      3. use case: compare histograms with different bin sizes
    2. using weights
      1. explaination of pmf
      2. use case: compare histograms with different populations:

ETA: Also I could reverse outline it b/c I've read this document a bunch of times now and reverse outlined it out of frustration b/c I was trying to understand the structure. And it's clearly well structured, and my hypothesis is folks would be more enticed to read it if they could see at a scan from the outline that it's well structured.

Comment on lines 99 to 100
# (``density = counts / (sum(counts) * np.diff(bins))``)
# (``np.sum(density * np.diff(bins)) == 1``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't compiling correctly.

@jklymak jklymak force-pushed the doc-histogram-normalizations branch from a305177 to f2da1f0 Compare December 6, 2023 02:11
@story645
Copy link
Member

story645 commented Dec 6, 2023

Also I think this is a conceptual tutorial on normalizing histograms far more than a gallery example on how to use the function.

Basically going by the rough distinction on the index page of:

Demo-> plot type and example galleries
Usage->user guide and tutorials

I think this is far more usage than demo.

Screenshot_20231206-012206.png

@jklymak
Copy link
Member Author

jklymak commented Dec 7, 2023

Thanks for the discussion. I'm going to decline to modify this example further at this time. I think the current contribution crosses the bar of being a helpful addition. I'd suggest further changes could be follow-up PRs.

@story645
Copy link
Member

story645 commented Dec 7, 2023

I'd suggest further changes could be follow-up PRs.

If that's true, then why not do it here? There's no urgency to get this in.

ETA: Also why not move it to tutorials, where it'll be easier to find?

story645
story645 previously approved these changes Dec 7, 2023
@story645 story645 dismissed their stale review December 7, 2023 06:19

it compiles so not gonna block

@story645 story645 force-pushed the doc-histogram-normalizations branch from 56b4b24 to b58498d Compare December 7, 2023 06:54
@jklymak
Copy link
Member Author

jklymak commented Dec 7, 2023

@story645 I do not agree with your changes. Is there a reason you are force pushing onto my branch?

@jklymak jklymak force-pushed the doc-histogram-normalizations branch from b58498d to f2da1f0 Compare December 7, 2023 14:35
@story645
Copy link
Member

story645 commented Dec 7, 2023

Accident so I undid it?

You can rebase to drop me from the commit history but I had to rebase to drop - I was using the gh cli & didn't realize it would just push back to yours.

@jklymak
Copy link
Member Author

jklymak commented Dec 7, 2023

I'll self merge based on #27426 (review)

@jklymak jklymak merged commit 01fe735 into matplotlib:main Dec 7, 2023
@jklymak jklymak deleted the doc-histogram-normalizations branch December 7, 2023 15:00
@tacaswell
Copy link
Member

Reviewing this was on my todo list for yesterday and this morning, sorry I did not get to it in time.

@jklymak
Copy link
Member Author

jklymak commented Dec 7, 2023

@tacaswell, it didn't change from your previous review except for a typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation: examples files in galleries/examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants