DOC: quantile: correct/simplify documentation #25704

mdhaber · 2024-01-27T10:51:42Z

gh-24588 noted inaccuracies in the quantile documentation. This PR corrects these and simplifies the descriptions of the various methods. It also refers users of percentile, nanquantile, nanpercentile to the documentation of quantile to reduce redunandancy and ensure that the notes of these functions do not get out of sync.

Closes gh-24588

@lorentzenchr

numpy/lib/_function_base_impl.py

mdhaber · 2024-02-02T17:03:37Z

Finally.

I guess it needed to be the simplified table markup and not include a reference link. @lorentzenchr @melissawm I think this is ready for a look when you have a moment. Link to the proposed quantile docs.

rossbar · 2024-02-02T18:44:16Z

and not include a reference link.

Sorry for catching this late - for future reference some numpydoc features don't work inside tables - there is a warning to this effect buried deep in the docs

lorentzenchr

This PR makes the reading much easier.
Note that the implementation follows the old text with "virtual indices". Now, the code gets even harder to read. I don't know how to best improve that.

lorentzenchr · 2024-02-05T07:55:54Z

numpy/lib/_function_base_impl.py

-
-        * if g > 0 ; then take j
-        * if g = 0 ; then average between bounds
+    By default (``method='linear'``), this is done by interpolating between


I would start differently here. Something like: "There are many different ways to estimate quantiles. All the choices implemented here are specified by the methods parameter and work as follows: On starts by making a sorted copy y or a..."

One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.

Maybe something like:

Suggested change

By default (``method='linear'``), this is done by interpolating between

The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between

I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method parameter.

To address this, I've removed the mention of the method parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method parameter.

numpy/lib/_function_base_impl.py

lorentzenchr · 2024-02-05T07:58:26Z

numpy/lib/_function_base_impl.py

-    interpolated_inverted_cdf:
-        method 4 of H&F [1]_.
-        This method gives continuous results using:
+    This a special case of Equation 1 of H&F [1]_. More generally,


Suggested change

This a special case of Equation 1 of H&F [1]_. More generally,

This is a special case of Equation 1 of H&F [1]_. More generally,

lorentzenchr · 2024-02-05T07:59:28Z

numpy/lib/_function_base_impl.py

-    midpoint:
-        NumPy method kept for backwards compatibility.
-        Uses ``(i + j) / 2``.
+    =============================== =============== ===============


It would be nice to one table with all the methods listed in HF, including number 1-3.

This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g, which breaks the nice interpretation of j and g being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.

Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:

For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:

There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.

The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)

The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.

lorentzenchr · 2024-02-05T08:01:26Z

numpy/lib/_function_base_impl.py

+    empirical distribution function of the data, i.e.
+    :math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`.
+    Then, different methods correspond to different choices of :math:`x` that
+    fulfill the above inequalities.


Suggested change

fulfill the above inequalities.

fulfill the above coverage conditions.

lorentzenchr · 2024-02-05T08:01:48Z

numpy/lib/_function_base_impl.py

+    :math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`.
+    Then, different methods correspond to different choices of :math:`x` that
+    fulfill the above inequalities.
+
    For weighted quantiles, the above coverage conditions still hold. The


Suggested change

For weighted quantiles, the above coverage conditions still hold. The

For weighted quantiles, the coverage conditions still hold. The

lorentzenchr · 2024-02-05T08:04:14Z

numpy/lib/_function_base_impl.py

+    Given a sample `a` from an underlying distribution, `quantile` provides a
+    nonparametric estimate of the inverse cumulative distribution function.
+
+    More formally, the quantile at probability level :math:`q` of a cumulative
    distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
    :math:`P` is defined as any number :math:`x` that fulfills the
    *coverage conditions*


Note that most methods listed in HF do not fulfil the coverage conditions.
In this restructuring, the line of argument that I introduces is a bit broken.
Still, I would very much like to keep the coverage conditions as they are very valuable for a reader and for quantiles!

Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.

I would definitely keep the coverage conditions. They are kind of the defining equations for a quantile in statistical terms. As estimations, sample quantiles are allowed to deviate. Note that P2 of HF is one of the coverage conditions.

Yes, I've kept them, and I've made your argument contiguous again.

mdhaber

Thanks for the review. I'll make changes after other comments have come in.

mdhaber · 2024-02-05T16:13:03Z

numpy/lib/_function_base_impl.py

+    Given a sample `a` from an underlying distribution, `quantile` provides a
+    nonparametric estimate of the inverse cumulative distribution function.
+
+    More formally, the quantile at probability level :math:`q` of a cumulative
    distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
    :math:`P` is defined as any number :math:`x` that fulfills the
    *coverage conditions*


Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.

mdhaber · 2024-02-05T16:25:48Z

numpy/lib/_function_base_impl.py

-
-        * if g > 0 ; then take j
-        * if g = 0 ; then average between bounds
+    By default (``method='linear'``), this is done by interpolating between


One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.

numpy/lib/_function_base_impl.py

mdhaber · 2024-02-05T16:41:47Z

numpy/lib/_function_base_impl.py

-    midpoint:
-        NumPy method kept for backwards compatibility.
-        Uses ``(i + j) / 2``.
+    =============================== =============== ===============


This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g, which breaks the nice interpretation of j and g being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.

mdhaber · 2024-02-05T16:45:58Z

numpy/lib/_function_base_impl.py

-    interpolated_inverted_cdf:
-        method 4 of H&F [1]_.
-        This method gives continuous results using:
+    This a special case of Equation 1 of H&F [1]_. More generally,


melissawm

I'm sorry I dropped the ball on this! I left a couple of comments but please keep in mind I'm no expert in this area.

One thing that stood out to me reading the full docstring is the mention of "R types" in the method parameter. A quick scan of the H&F paper didn't help me - I tried tracking this change in the docs but didn't really manage to find why this comment was added (it was added in 1.22.0). If you know what this means, is there a way we can explain this comment? Otherwise it can be confusing for readers.

melissawm · 2024-04-24T18:31:45Z

numpy/lib/_function_base_impl.py

-
-        * if g > 0 ; then take j
-        * if g = 0 ; then average between bounds
+    By default (``method='linear'``), this is done by interpolating between


Maybe something like:

Suggested change

By default (``method='linear'``), this is done by interpolating between

The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between

I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method parameter.

melissawm · 2024-04-24T18:40:30Z

numpy/lib/_function_base_impl.py

-    midpoint:
-        NumPy method kept for backwards compatibility.
-        Uses ``(i + j) / 2``.
+    =============================== =============== ===============


Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:

For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:

melissawm · 2024-04-24T18:44:46Z

Overall, much much better! Thank you so much @mdhaber and @lorentzenchr for the review!

mdhaber · 2024-04-24T19:15:01Z

numpy/lib/_function_base_impl.py

-    midpoint:
-        NumPy method kept for backwards compatibility.
-        Uses ``(i + j) / 2``.
+    =============================== =============== ===============


There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.

The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)

The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.

mdhaber · 2024-04-24T19:19:27Z

numpy/lib/_function_base_impl.py

+    The methods above are all continuous functions of probability `q`.
+    Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where
+    ``j`` is defined as above and ``m`` and ``g`` are defined as follows.


Suggested change

The methods above are all continuous functions of probability `q`.

Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where

``j`` is defined as above and ``m`` and ``g`` are defined as follows.

The table above includes only the estimators from H&F that are continuous

functions of probability `q` (estimators 4-9). NumPy also provides the

three discontinuous estimators from H&F (estimators 1-3), where ``j`` is

defined as above and ``m`` and ``g`` are defined as follows.

mdhaber · 2024-04-24T19:28:38Z

numpy/lib/_function_base_impl.py

+    where ``m`` is defined differently for each value of the ``method``
+    parameter.


Suggested change

where ``m`` is defined differently for each value of the ``method``

parameter.

where ``m`` may be defined according to several different conventions.

The preferred convention may be selected using the ``method`` parameter:

mdhaber · 2024-04-24T19:30:56Z

numpy/lib/_function_base_impl.py

-
-        * if g > 0 ; then take j
-        * if g = 0 ; then average between bounds
+    By default (``method='linear'``), this is done by interpolating between


To address this, I've removed the mention of the method parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method parameter.

mdhaber · 2024-04-24T19:35:09Z

numpy/lib/_function_base_impl.py

+    Given a sample `a` from an underlying distribution, `quantile` provides a
+    nonparametric estimate of the inverse cumulative distribution function.
+
+    More formally, the quantile at probability level :math:`q` of a cumulative
    distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
    :math:`P` is defined as any number :math:`x` that fulfills the
    *coverage conditions*


Yes, I've kept them, and I've made your argument contiguous again.

mdhaber · 2024-04-24T19:45:07Z

Thanks @melissawm. I think the "R types" refer to the "Types" listed in the documentation of R's quantile function. I just removed this because it is confusing, and anyone already familiar with the R types will recognize these as being identical because the reference and numbering scheme is identical. I've also implemented changes to respond to the other comments. Hopefully this can be backported to NumPy 2.0? It would help to be able to refer to this when discussing the addition of quantile to the array API (data-apis/array-api#795).

melissawm

This looks great to me, thank you @mdhaber !

melissawm · 2024-05-03T19:56:58Z

I'll leave it open in case @lorentzenchr wants to take a last look. Cheers!

mdhaber · 2024-05-03T20:05:17Z

Since this has been close to its present state for a few months, I'd ask whether we can merge as-is, and if there are additional changes to be made, they can be made in a new PR by the one who wants to make them?

lorentzenchr · 2024-05-03T20:56:25Z

I'll have a quick look...

lorentzenchr

Minor comments. Would be good without them.
I really like this doc improvement. It reads so much better!

I have no rights to approve, but consider it a LGTM.

lorentzenchr · 2024-05-03T21:02:04Z

numpy/lib/_function_base_impl.py

-    distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
-    :math:`P` is defined as any number :math:`x` that fulfills the
-    *coverage conditions*
+    Given a sample `a` from an underlying distribution, `quantile` provides a


Sometimes it's `a` sometimes its ``a`` and it renders differently.

According to numpydoc standards , single backticks are correct here because a is an argument of this function. There were only single-backticks in the original docstring, and I've used single backticks consistently here. However, I see that some of each were added in gh-24254, and there's one double-backtick left that I didn't notice, since I wrote this before that PR merged. I added #25704 (comment) to fix this.

For more information, see numpy/numpydoc#52.

lorentzenchr · 2024-05-03T21:02:54Z

numpy/lib/_function_base_impl.py

-
-    .. math::
-        i + g = q * ( n - alpha - beta + 1 ) + alpha
+    By default, this is done by interpolating between adjacent elements in


Suggested change

By default, this is done by interpolating between adjacent elements in

By default (`method="linear"`), this is done by interpolating between adjacent elements in

I had this, but removed it to respond to another comment.

lorentzenchr · 2024-05-03T21:20:46Z

numpy/lib/_function_base_impl.py

-        method 2 of H&F [1]_.
-        This method gives discontinuous results:
+    - ``j = (q*n + m - 1) // 1``, and
+    - ``g = (q*n + m - 1) % 1``,


Maybe add here, that the -1 is for translating from 1-based (H&F) to 0-based (Python) indexing.

Yes, this is the place I mentioned here where the -1 can be attributed to the 0-based indexing. Without significant rework, I think adding this information would interrupt the important part about m, but here is a possible way to include this information.

lorentzenchr · 2024-05-03T21:26:03Z

numpy/lib/_function_base_impl.py


-        * alpha = 0
-        * beta = 1
+    1. ``inverted_cdf``: ``m = 0`` and ``g = int(q*n > 0)``


In my experience, this is the version used the most in statistical literature and corresponds to H&F's first equation in the introduction. But this equation is not general enough, it neglects distributions with flat regions, e.g. step functions.

Summary: better no action.

I don't understand. Does "better no action" mean that there is no action to take to respond to this comment, or that I should not have taken action to adjust this part of the documentation?

Yes, I suggest to leave it as is. Just wanted to share thoughts with a fellow statistian.

numpy/lib/_function_base_impl.py

mdhaber · 2024-05-03T22:39:44Z

numpy/lib/_function_base_impl.py


-        * if g > 0 ; then take j
-        * if g = 0 ; then average between bounds
+    where ``m`` may be defined according to several different conventions.


To respond to this,

Suggested change

where ``m`` may be defined according to several different conventions.

where the ``- 1`` in these formulas accounts for Python's 0-based indexing,

and ``m`` may be defined according to several different conventions.

I suppose I left this out original because I chose not to explain all of the ways in which the formulas are not written as they are in the paper. For instance, I also use % to take the fractional part rather than subtracting the integral part. This seems equally notable to me as an adaptation from math to code. While adding such information would be somewhat helpful those who want to compare this to the original text, I think they make it harder to follow for those who are reading only this document. I would prefer not to add this.

// and % are clear and don’t need explanation.
The different base of indexing is in my opinion worth mentioning somewhere. It took me a while to figure it out.
Maybe a good place is below the reference.

https://github.com/numpy/numpy/pull/25704/files#r1590013796

numpy/lib/_function_base_impl.py

lorentzenchr · 2024-05-05T09:58:25Z

@seberg Why did you close?

numpy/lib/_function_base_impl.py

seberg · 2024-05-05T10:57:51Z

@seberg Why did you close?

Ouch, I honstly don't know how that happened!

mdhaber · 2024-05-05T18:36:07Z

I see the Windows CI failure in another PR (gh-26386); seems unrelated.
After fixing the lint issue I squashed to clean up the commit message (since not everyone squash-merges), thinking that the "compare" link would let you see the changes easily, but it doesn't. Besides fixing the lint issue, I only tweaked "for" to "of" in the description of the q argument and moved the bit about "Sample quantiles, the result..." from the top, where it was redundant, to after "More formally..." where it belonged. That made all that text contiguous, so that most of its commit credit is preserved. The backup branch is here in case you want to check out the diff.

seberg · 2024-05-06T11:57:57Z

Had a quick read through and it looks good and it I think things were settling. Thanks everyone, let's get this in!

mdhaber requested a review from melissawm January 27, 2024 10:51

github-actions bot added the 04 - Documentation label Jan 27, 2024

eendebakpt reviewed Jan 27, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Show resolved Hide resolved

numpy/lib/_function_base_impl.py Show resolved Hide resolved

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 28, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 28, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 28, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 28, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 28, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented Jan 30, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

lorentzenchr reviewed Feb 5, 2024

View reviewed changes

mdhaber commented Feb 5, 2024

View reviewed changes

ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Feb 17, 2024

melissawm reviewed Apr 24, 2024

View reviewed changes

mdhaber commented Apr 24, 2024

View reviewed changes

mdhaber requested a review from melissawm April 29, 2024 05:52

melissawm approved these changes May 3, 2024

View reviewed changes

lorentzenchr reviewed May 3, 2024

View reviewed changes

mdhaber commented May 3, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented May 4, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber commented May 4, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

mdhaber requested a review from melissawm May 4, 2024 16:36

seberg closed this May 5, 2024

lorentzenchr reviewed May 5, 2024

View reviewed changes

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved

seberg reopened this May 5, 2024

DOC: quantile: correct/simplify documentation

52ddb74

mdhaber force-pushed the quantile_doc branch from a9614eb to 52ddb74 Compare May 5, 2024 18:21

seberg merged commit f9e8fd5 into numpy:main May 6, 2024

	By default (``method='linear'``), this is done by interpolating between
	The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between

	This a special case of Equation 1 of H&F [1]_. More generally,
	This is a special case of Equation 1 of H&F [1]_. More generally,

	fulfill the above inequalities.
	fulfill the above coverage conditions.

	For weighted quantiles, the above coverage conditions still hold. The
	For weighted quantiles, the coverage conditions still hold. The

		where ``m`` is defined differently for each value of the ``method``
		parameter.

	By default, this is done by interpolating between adjacent elements in
	By default (`method="linear"`), this is done by interpolating between adjacent elements in

	where ``m`` may be defined according to several different conventions.
	where the ``- 1`` in these formulas accounts for Python's 0-based indexing,
	and ``m`` may be defined according to several different conventions.

Uh oh!

DOC: quantile: correct/simplify documentation #25704

DOC: quantile: correct/simplify documentation #25704

Uh oh!

Conversation

mdhaber commented Jan 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdhaber commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rossbar commented Feb 2, 2024

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melissawm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melissawm commented Apr 24, 2024

Uh oh!

Choose a reason for hiding this comment

mdhaber commented Jan 27, 2024 •

edited

Loading

mdhaber commented Feb 2, 2024 •

edited

Loading

mdhaber commented Apr 24, 2024 •

edited

Loading

mdhaber commented May 3, 2024 •

edited

Loading

mdhaber May 3, 2024 •

edited

Loading

mdhaber May 3, 2024 •

edited

Loading