Skip to content

DOC: quantile: correct/simplify documentation #25704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2024

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Jan 27, 2024

gh-24588 noted inaccuracies in the quantile documentation. This PR corrects these and simplifies the descriptions of the various methods. It also refers users of percentile, nanquantile, nanpercentile to the documentation of quantile to reduce redunandancy and ensure that the notes of these functions do not get out of sync.

Closes gh-24588

@lorentzenchr

@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 2, 2024

Finally.

image

I guess it needed to be the simplified table markup and not include a reference link. @lorentzenchr @melissawm I think this is ready for a look when you have a moment. Link to the proposed quantile docs.

@rossbar
Copy link
Contributor

rossbar commented Feb 2, 2024

and not include a reference link.

Sorry for catching this late - for future reference some numpydoc features don't work inside tables - there is a warning to this effect buried deep in the docs

Copy link
Contributor

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes the reading much easier.
Note that the implementation follows the old text with "virtual indices". Now, the code gets even harder to read. I don't know how to best improve that.


* if g > 0 ; then take j
* if g = 0 ; then average between bounds
By default (``method='linear'``), this is done by interpolating between
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start differently here. Something like: "There are many different ways to estimate quantiles. All the choices implemented here are specified by the methods parameter and work as follows: On starts by making a sorted copy y or a..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Suggested change
By default (``method='linear'``), this is done by interpolating between
The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between

I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this, I've removed the mention of the method parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method parameter.

interpolated_inverted_cdf:
method 4 of H&F [1]_.
This method gives continuous results using:
This a special case of Equation 1 of H&F [1]_. More generally,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This a special case of Equation 1 of H&F [1]_. More generally,
This is a special case of Equation 1 of H&F [1]_. More generally,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

midpoint:
NumPy method kept for backwards compatibility.
Uses ``(i + j) / 2``.
=============================== =============== ===============
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to one table with all the methods listed in HF, including number 1-3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g, which breaks the nice interpretation of j and g being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:

For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.

The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)

The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.

empirical distribution function of the data, i.e.
:math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`.
Then, different methods correspond to different choices of :math:`x` that
fulfill the above inequalities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fulfill the above inequalities.
fulfill the above coverage conditions.

:math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`.
Then, different methods correspond to different choices of :math:`x` that
fulfill the above inequalities.

For weighted quantiles, the above coverage conditions still hold. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For weighted quantiles, the above coverage conditions still hold. The
For weighted quantiles, the coverage conditions still hold. The

Given a sample `a` from an underlying distribution, `quantile` provides a
nonparametric estimate of the inverse cumulative distribution function.

More formally, the quantile at probability level :math:`q` of a cumulative
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
:math:`P` is defined as any number :math:`x` that fulfills the
*coverage conditions*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that most methods listed in HF do not fulfil the coverage conditions.
In this restructuring, the line of argument that I introduces is a bit broken.
Still, I would very much like to keep the coverage conditions as they are very valuable for a reader and for quantiles!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely keep the coverage conditions. They are kind of the defining equations for a quantile in statistical terms. As estimations, sample quantiles are allowed to deviate. Note that P2 of HF is one of the coverage conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've kept them, and I've made your argument contiguous again.

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I'll make changes after other comments have come in.

Given a sample `a` from an underlying distribution, `quantile` provides a
nonparametric estimate of the inverse cumulative distribution function.

More formally, the quantile at probability level :math:`q` of a cumulative
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
:math:`P` is defined as any number :math:`x` that fulfills the
*coverage conditions*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.


* if g > 0 ; then take j
* if g = 0 ; then average between bounds
By default (``method='linear'``), this is done by interpolating between
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.

midpoint:
NumPy method kept for backwards compatibility.
Uses ``(i + j) / 2``.
=============================== =============== ===============
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g, which breaks the nice interpretation of j and g being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.

interpolated_inverted_cdf:
method 4 of H&F [1]_.
This method gives continuous results using:
This a special case of Equation 1 of H&F [1]_. More generally,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Feb 17, 2024
Copy link
Member

@melissawm melissawm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I dropped the ball on this! I left a couple of comments but please keep in mind I'm no expert in this area.

One thing that stood out to me reading the full docstring is the mention of "R types" in the method parameter. A quick scan of the H&F paper didn't help me - I tried tracking this change in the docs but didn't really manage to find why this comment was added (it was added in 1.22.0). If you know what this means, is there a way we can explain this comment? Otherwise it can be confusing for readers.


* if g > 0 ; then take j
* if g = 0 ; then average between bounds
By default (``method='linear'``), this is done by interpolating between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Suggested change
By default (``method='linear'``), this is done by interpolating between
The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between

I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method parameter.

midpoint:
NumPy method kept for backwards compatibility.
Uses ``(i + j) / 2``.
=============================== =============== ===============
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:

For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:

@melissawm
Copy link
Member

Overall, much much better! Thank you so much @mdhaber and @lorentzenchr for the review!

midpoint:
NumPy method kept for backwards compatibility.
Uses ``(i + j) / 2``.
=============================== =============== ===============
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.

The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)

The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.

Comment on lines 4332 to 4334
The methods above are all continuous functions of probability `q`.
Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where
``j`` is defined as above and ``m`` and ``g`` are defined as follows.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The methods above are all continuous functions of probability `q`.
Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where
``j`` is defined as above and ``m`` and ``g`` are defined as follows.
The table above includes only the estimators from H&F that are continuous
functions of probability `q` (estimators 4-9). NumPy also provides the
three discontinuous estimators from H&F (estimators 1-3), where ``j`` is
defined as above and ``m`` and ``g`` are defined as follows.

Comment on lines 4314 to 4315
where ``m`` is defined differently for each value of the ``method``
parameter.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where ``m`` is defined differently for each value of the ``method``
parameter.
where ``m`` may be defined according to several different conventions.
The preferred convention may be selected using the ``method`` parameter:


* if g > 0 ; then take j
* if g = 0 ; then average between bounds
By default (``method='linear'``), this is done by interpolating between
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this, I've removed the mention of the method parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method parameter.

Given a sample `a` from an underlying distribution, `quantile` provides a
nonparametric estimate of the inverse cumulative distribution function.

More formally, the quantile at probability level :math:`q` of a cumulative
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
:math:`P` is defined as any number :math:`x` that fulfills the
*coverage conditions*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've kept them, and I've made your argument contiguous again.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 24, 2024

Thanks @melissawm. I think the "R types" refer to the "Types" listed in the documentation of R's quantile function. I just removed this because it is confusing, and anyone already familiar with the R types will recognize these as being identical because the reference and numbering scheme is identical. I've also implemented changes to respond to the other comments. Hopefully this can be backported to NumPy 2.0? It would help to be able to refer to this when discussing the addition of quantile to the array API (data-apis/array-api#795).

@mdhaber mdhaber requested a review from melissawm April 29, 2024 05:52
Copy link
Member

@melissawm melissawm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thank you @mdhaber !

@melissawm
Copy link
Member

I'll leave it open in case @lorentzenchr wants to take a last look. Cheers!

@mdhaber
Copy link
Contributor Author

mdhaber commented May 3, 2024

Since this has been close to its present state for a few months, I'd ask whether we can merge as-is, and if there are additional changes to be made, they can be made in a new PR by the one who wants to make them?

@lorentzenchr
Copy link
Contributor

I'll have a quick look...

Copy link
Contributor

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Would be good without them.
I really like this doc improvement. It reads so much better!

I have no rights to approve, but consider it a LGTM.

distribution function :math:`F(y)=P(Y \\leq y)` with probability measure
:math:`P` is defined as any number :math:`x` that fulfills the
*coverage conditions*
Given a sample `a` from an underlying distribution, `quantile` provides a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it's `a` sometimes its ``a`` and it renders differently.

Copy link
Contributor Author

@mdhaber mdhaber May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to numpydoc standards , single backticks are correct here because a is an argument of this function. There were only single-backticks in the original docstring, and I've used single backticks consistently here. However, I see that some of each were added in gh-24254, and there's one double-backtick left that I didn't notice, since I wrote this before that PR merged. I added #25704 (comment) to fix this.

For more information, see numpy/numpydoc#52.


.. math::
i + g = q * ( n - alpha - beta + 1 ) + alpha
By default, this is done by interpolating between adjacent elements in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, this is done by interpolating between adjacent elements in
By default (`method="linear"`), this is done by interpolating between adjacent elements in

Copy link
Contributor Author

@mdhaber mdhaber May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this, but removed it to respond to another comment.

method 2 of H&F [1]_.
This method gives discontinuous results:
- ``j = (q*n + m - 1) // 1``, and
- ``g = (q*n + m - 1) % 1``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add here, that the -1 is for translating from 1-based (H&F) to 0-based (Python) indexing.

Copy link
Contributor Author

@mdhaber mdhaber May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the place I mentioned here where the -1 can be attributed to the 0-based indexing. Without significant rework, I think adding this information would interrupt the important part about m, but here is a possible way to include this information.


* alpha = 0
* beta = 1
1. ``inverted_cdf``: ``m = 0`` and ``g = int(q*n > 0)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, this is the version used the most in statistical literature and corresponds to H&F's first equation in the introduction. But this equation is not general enough, it neglects distributions with flat regions, e.g. step functions.

Summary: better no action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Does "better no action" mean that there is no action to take to respond to this comment, or that I should not have taken action to adjust this part of the documentation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suggest to leave it as is. Just wanted to share thoughts with a fellow statistian.


* if g > 0 ; then take j
* if g = 0 ; then average between bounds
where ``m`` may be defined according to several different conventions.
Copy link
Contributor Author

@mdhaber mdhaber May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To respond to this,

Suggested change
where ``m`` may be defined according to several different conventions.
where the ``- 1`` in these formulas accounts for Python's 0-based indexing,
and ``m`` may be defined according to several different conventions.

I suppose I left this out original because I chose not to explain all of the ways in which the formulas are not written as they are in the paper. For instance, I also use % to take the fractional part rather than subtracting the integral part. This seems equally notable to me as an adaptation from math to code. While adding such information would be somewhat helpful those who want to compare this to the original text, I think they make it harder to follow for those who are reading only this document. I would prefer not to add this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// and % are clear and don’t need explanation.
The different base of indexing is in my opinion worth mentioning somewhere. It took me a while to figure it out.
Maybe a good place is below the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber mdhaber requested a review from melissawm May 4, 2024 16:36
@seberg seberg closed this May 5, 2024
@lorentzenchr
Copy link
Contributor

@seberg Why did you close?

@seberg
Copy link
Member

seberg commented May 5, 2024

@seberg Why did you close?

Ouch, I honstly don't know how that happened!

@seberg seberg reopened this May 5, 2024
@mdhaber
Copy link
Contributor Author

mdhaber commented May 5, 2024

I see the Windows CI failure in another PR (gh-26386); seems unrelated.
After fixing the lint issue I squashed to clean up the commit message (since not everyone squash-merges), thinking that the "compare" link would let you see the changes easily, but it doesn't. Besides fixing the lint issue, I only tweaked "for" to "of" in the description of the q argument and moved the bit about "Sample quantiles, the result..." from the top, where it was redundant, to after "More formally..." where it belonged. That made all that text contiguous, so that most of its commit credit is preserved. The backup branch is here in case you want to check out the diff.

@seberg
Copy link
Member

seberg commented May 6, 2024

Had a quick read through and it looks good and it I think things were settling. Thanks everyone, let's get this in!

@seberg seberg merged commit f9e8fd5 into numpy:main May 6, 2024
63 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
04 - Documentation triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: interpretation of terms in quantile documentation
7 participants