DOC: normalizing histograms #27426

jklymak · 2023-12-02T22:39:40Z

People often seem confused by the density kwarg of hist. I don't think we should change it, but we could document better.

This was similar enough to histogram_features that I removed that and added a redirect.

story645

There's a lot of good information here, but honestly it feels a bit overwhelming. I think adding some headings in the normalizing bins sections highlighting what you're trying to show in each subsection may help anchor the reader & honestly make it more likely they won't just skim over the thing but actually look at the section that's relevant for 'em. ETA: same w. the code honestly -> separating the plotting from the labeling code a bit more might make it easier to single out what functionality is trying to be highlighted here.

story645 · 2023-12-02T23:34:51Z

galleries/examples/statistics/histogram_normalization.py

+# to make the point very obvious, consider bins that do not have the same
+# spacing.  By normalizing by density, we preserve the shape of the
+# distribution, whereas if we do not, then the wider bins have much higher
+# values than the thin bins:


If this is supposed to be about appropriate bin choice, then separate that out into it's own example w/ a side by side of equally and irregularly spaced bins? Basically I'm reading this trying to visualize the point you're trying to make, so you may as well just visualize it?

At minimum, I had to read this a couple of times to understand what was going on because of the way that the top sentence was breaking. I'm honestly still not sure if this is what I'm supposed to parse out of this:

By normalizing by density, we preserve the shape of the distribution. We emphasize this point using irregularly spaced bins to show that in the unnormalized example the frequencies are much more sensitive to bin width.

I've added a new section before this, as I agree this was too big a leap. Sorry - I was still playing with it, and should have marked as draft.

No, you did mark it draft - I should have asked if draft meant ready for feedback. I mark draft as an I don't think it's ready to merge, but ready for reviews.

I consider Draft to mean not ready for review:

Sorry and ok, got that for next time. Limit of github that there isn't a third -> ready for feedback but still nascent option :/ ETA: meaning I take that review very literally as ready for final it can be merged on approval review.

I think it's fine to solicit review for a draft PR, but I just use Draft to abuse CI. Maybe a bad habit, but nicer for a server farm to build the docs for me than doing it locally.

That's fair - code spaces might also be really good for your use case and faster.

I'm thinking about proposing adding a field to the PR template asking `if draft, would you like feedback?[], anything in particular" b/c I think this might be a fairly common expectation mismatch (half of us are drafts are for early round feedback while messing around and encourage folks to use it as such, half of us are drafts are messing around before feedback please don't touch) and it's one of those things I don't think folks necessarily think to communicate or ask about.

galleries/examples/statistics/histogram_normalization.py

oscargus · 2023-12-04T09:30:47Z

I spent some time reading up on this the other day, so I think this is a valuable addition! (I wanted the probability mass function it turned out.)

jklymak · 2023-12-05T00:29:10Z

https://output.circle-artifacts.com/output/job/bd0c26ea-c7d9-4cb5-8f9b-f2c8923574c7/artifacts/0/doc/build/html/gallery/statistics/histogram_normalization.html#sphx-glr-gallery-statistics-histogram-normalization-py

tacaswell · 2023-12-05T21:12:49Z

galleries/examples/statistics/histogram_normalization.py

+# %%
+# This normalization can be a little hard to interpret when just exploring the
+# data. The value attached to each bar is divided by the total number of data
+# points _and_ the width of the bin, and thus the values _integrate_ to one


Suggested change

# points _and_ the width of the bin, and thus the values _integrate_ to one

# points *and* the width of the bin, and thus the values *integrate* to one

we are in rst not md here

tacaswell · 2023-12-05T21:16:01Z

galleries/examples/statistics/histogram_normalization.py

+# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),
+# and (``np.sum(density * np.diff(bins)) == 1``).


Suggested change

# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),

# and (``np.sum(density * np.diff(bins)) == 1``).

# e.g. ::

#

# density = counts / (sum(counts) * np.diff(bins))

# np.sum(density * np.diff(bins)) == 1

The in-line code snippets are hard to read the way they got wrapped, do as a code block?

tacaswell

I left two small style comments, but this is a significant improvement even without them.

@jklymak can self-merge with or without my suggestions.

story645 · 2023-12-05T22:31:03Z

The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues with working memory are really common for folks with dyslexia, burnout, or ADHD (👋).

A very very low hanging fruit way to make the docs more accessible is some anchoring through titles and headings. Nothing complicated, just tell folks what they should be focusing on/picking out from a subsection/example. Smaller/cleaner examples would be great too, but subheadings are I don't think a big enough ask to be out of scope & much easier to put in now then on a new cycle. ETA: And I'm specifically making this ask of Jody b/c he's an experienced educator and technical writer - I wouldn't necessarily make this ask in other contexts.

story645

I'm requesting changes b/c the .rst up top is over indented and that needs to be fixed.

story645 · 2023-12-05T22:33:10Z

galleries/examples/statistics/histogram_normalization.py

+  - bin the data as you want, either with an automatically chosen number of
+    bins, or with fixed bin edges,
+  - normalize the histogram so that its integral is one,
+  - and assign weights to the data points, so that each data point affects the
+    count in its bin differently.


Suggested change

- bin the data as you want, either with an automatically chosen number of

bins, or with fixed bin edges,

- normalize the histogram so that its integral is one,

- and assign weights to the data points, so that each data point affects the

count in its bin differently.

- bin the data as you want, either with an automatically chosen number of

bins, or with fixed bin edges,

- normalize the histogram so that its integral is one,

- and assign weights to the data points, so that each data point affects the

count in its bin differently.

story645 · 2023-12-05T22:40:04Z

galleries/examples/statistics/histogram_normalization.py

+fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained')
+dx = 0.1
+xbins = np.arange(-4, 4, dx)
+ax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts')


For all the bin width comparisons, it's kind of hard to tell the bin widths from the 'step' type - is there a way to actually show the bins?

With so many bins, vertical lines end up almost merging. Fewer bins, it's hard to see the normal distribution.

What about stacking as rows instead of columns so you have more horizontal space to work in?

story645 · 2023-12-05T22:41:03Z

galleries/examples/statistics/histogram_normalization.py

+    xbins = np.arange(-4, 4, dx)
+    # expected histogram:
+    ax['False'].plot(xpdf, pdf*1000*dx, '--', color=f'C{nn}')
+    ax['False'].hist(xdata, bins=xbins, density=False, histtype='step')
+
+    ax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label=dx)
+


Do you need both here and all three? only because the busyness makes it feel very cluttered in a way where it's hard to read off the lesson

This is the main point of why you want to normalize, so comparing and contrasting with and without normalizing is the goal. Multiple bin sizes is to better give the reader an idea of how the bin size affects their results. Sure, it's busy, but I don't think incomprehensibly so.

Maybe same as above, stacking horizontally ? or maybe thinner lines + making the histogram colors paler so that it's easier to visually distinguish?

jklymak · 2023-12-06T01:12:58Z

The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues with working memory are really common for folks with dyslexia, burnout, or ADHD (👋).

The goal of this example is not to provide a random-access reference. That is already in the API docs. The goal is to explain carefully what the normalization options do so that we can point folks who ask why we normalize histograms the way we do towards this reference for an explanation. I consider this a relatively advanced topic, and I think breaking it into more sections or subsections would be more distracting than helpful.

story645 · 2023-12-06T02:05:32Z

I consider this a relatively advanced topic

That's mutually exclusive from its accessibility in the universal design context?

I'm not saying make the text accessible to folks who don't have the mathematical background or Python knowledge to parse it, I'm asking that you make it more accessible to folks who's brains may be wired a bit different.

Plenty of writing on advanced topics (and most academic papers) break things out - a great example is 7 Stages in Compositionality, which is intro to category theory

I think breaking it into more sections or subsections would be more distracting than helpful.

Is it distracting for you? Because then it's competing needs and we should figure out if we can find something that works for both of us. The lack of sections is super distracting for me, b/c I don't know where to focus or where one part ends or the other begins. And as someone who frequently links folks to our docs, it's nice when I have a specific place to do so.

Reading through it again, I'm guessing the following, which if I'm correct I don't see the harm in putting this in the document & if I'm wrong it would be helpful to have that table setting in the document.

Choosing Bins
1. passing in bin edges
2. passing in number of bins
Normalizing Histograms
1. density = True and scaling
  1. explanation of integration
  2. use case: preserve shape
  3. use case: compare histograms with different bin sizes
2. using weights
  1. explaination of pmf
  2. use case: compare histograms with different populations:

ETA: Also I could reverse outline it b/c I've read this document a bunch of times now and reverse outlined it out of frustration b/c I was trying to understand the structure. And it's clearly well structured, and my hypothesis is folks would be more enticed to read it if they could see at a scan from the outline that it's well structured.

story645 · 2023-12-06T02:06:58Z

galleries/examples/statistics/histogram_normalization.py

+#     (``density = counts / (sum(counts) * np.diff(bins))``)
+#     (``np.sum(density * np.diff(bins)) == 1``).


this isn't compiling correctly.

story645 · 2023-12-06T06:12:24Z

Also I think this is a conceptual tutorial on normalizing histograms far more than a gallery example on how to use the function.

Basically going by the rough distinction on the index page of:

Demo-> plot type and example galleries
Usage->user guide and tutorials

I think this is far more usage than demo.

jklymak · 2023-12-07T03:56:04Z

Thanks for the discussion. I'm going to decline to modify this example further at this time. I think the current contribution crosses the bar of being a helpful addition. I'd suggest further changes could be follow-up PRs.

story645 · 2023-12-07T04:38:14Z

I'd suggest further changes could be follow-up PRs.

If that's true, then why not do it here? There's no urgency to get this in.

ETA: Also why not move it to tutorials, where it'll be easier to find?

it compiles so not gonna block

jklymak · 2023-12-07T14:35:16Z

@story645 I do not agree with your changes. Is there a reason you are force pushing onto my branch?

story645 · 2023-12-07T14:35:56Z

Accident so I undid it?

You can rebase to drop me from the commit history but I had to rebase to drop - I was using the gh cli & didn't realize it would just push back to yours.

jklymak · 2023-12-07T15:00:24Z

I'll self merge based on #27426 (review)

tacaswell · 2023-12-07T15:04:17Z

Reviewing this was on my todo list for yesterday and this morning, sorry I did not get to it in time.

jklymak · 2023-12-07T15:19:11Z

@tacaswell, it didn't change from your previous review except for a typo.

jklymak added the Documentation: examples files in galleries/examples label Dec 2, 2023

story645 reviewed Dec 2, 2023

View reviewed changes

jklymak force-pushed the doc-histogram-normalizations branch from 981021d to 3e98181 Compare December 3, 2023 16:54

jklymak marked this pull request as draft December 3, 2023 18:20

jklymak force-pushed the doc-histogram-normalizations branch 2 times, most recently from e00fe2e to aa52a1d Compare December 4, 2023 03:52

oscargus reviewed Dec 4, 2023

View reviewed changes

galleries/examples/statistics/histogram_normalization.py Outdated Show resolved Hide resolved

oscargus reviewed Dec 4, 2023

View reviewed changes

galleries/examples/statistics/histogram_normalization.py Show resolved Hide resolved

jklymak force-pushed the doc-histogram-normalizations branch 2 times, most recently from a0905d4 to b1afe65 Compare December 4, 2023 23:34

jklymak marked this pull request as ready for review December 5, 2023 00:29

tacaswell added this to the v3.9.0 milestone Dec 5, 2023

tacaswell reviewed Dec 5, 2023

View reviewed changes

tacaswell approved these changes Dec 5, 2023

View reviewed changes

story645 requested changes Dec 5, 2023

View reviewed changes

story645 reviewed Dec 6, 2023

View reviewed changes

DOC: normalizing histograms

f2da1f0

jklymak force-pushed the doc-histogram-normalizations branch from a305177 to f2da1f0 Compare December 6, 2023 02:11

story645 previously approved these changes Dec 7, 2023

View reviewed changes

story645 force-pushed the doc-histogram-normalizations branch from 56b4b24 to b58498d Compare December 7, 2023 06:54

story645 mentioned this pull request Dec 7, 2023

Doc: follow up on histogram normalization example #27459

Draft

5 tasks

jklymak force-pushed the doc-histogram-normalizations branch from b58498d to f2da1f0 Compare December 7, 2023 14:35

jklymak merged commit 01fe735 into matplotlib:main Dec 7, 2023

jklymak deleted the doc-histogram-normalizations branch December 7, 2023 15:00

jklymak mentioned this pull request Jan 5, 2024

[DOC]: usage docs content guidelines #26389

Draft

	# points _and_ the width of the bin, and thus the values _integrate_ to one
	# points and the width of the bin, and thus the values integrate to one

		# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),
		# and (``np.sum(density * np.diff(bins)) == 1``).

-# e.g. (``density = counts / (sum(counts) * np.diff(bins))``),
-# and (``np.sum(density * np.diff(bins)) == 1``).
+# e.g. ::
+#
+#    density = counts / (sum(counts) * np.diff(bins))
+#    np.sum(density * np.diff(bins)) == 1

		# (``density = counts / (sum(counts) * np.diff(bins))``)
		# (``np.sum(density * np.diff(bins)) == 1``).

Uh oh!

DOC: normalizing histograms #27426

DOC: normalizing histograms #27426

Uh oh!

Conversation

jklymak commented Dec 2, 2023

Uh oh!

story645 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

story645 Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oscargus commented Dec 4, 2023

Uh oh!

jklymak commented Dec 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tacaswell left a comment

Choose a reason for hiding this comment

Uh oh!

story645 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklymak commented Dec 6, 2023

Uh oh!

story645 commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

story645 commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Dec 7, 2023

Uh oh!

story645 commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Dec 7, 2023

Uh oh!

story645 left a comment •

edited

Loading

story645 Dec 4, 2023 •

edited

Loading

story645 commented Dec 5, 2023 •

edited

Loading

story645 commented Dec 6, 2023 •

edited

Loading

story645 commented Dec 6, 2023 •

edited

Loading

story645 commented Dec 7, 2023 •

edited

Loading

story645 commented Dec 7, 2023 •

edited

Loading