-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
DOC: quantile: correct/simplify documentation #25704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Finally. ![]() I guess it needed to be the simplified table markup and not include a reference link. @lorentzenchr @melissawm I think this is ready for a look when you have a moment. Link to the proposed |
Sorry for catching this late - for future reference some numpydoc features don't work inside tables - there is a warning to this effect buried deep in the docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes the reading much easier.
Note that the implementation follows the old text with "virtual indices". Now, the code gets even harder to read. I don't know how to best improve that.
numpy/lib/_function_base_impl.py
Outdated
|
||
* if g > 0 ; then take j | ||
* if g = 0 ; then average between bounds | ||
By default (``method='linear'``), this is done by interpolating between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would start differently here. Something like: "There are many different ways to estimate quantiles. All the choices implemented here are specified by the methods
parameter and work as follows: On starts by making a sorted copy y or a..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like:
By default (``method='linear'``), this is done by interpolating between | |
The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between |
I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this, I've removed the mention of the method
parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method
parameter.
numpy/lib/_function_base_impl.py
Outdated
interpolated_inverted_cdf: | ||
method 4 of H&F [1]_. | ||
This method gives continuous results using: | ||
This a special case of Equation 1 of H&F [1]_. More generally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a special case of Equation 1 of H&F [1]_. More generally, | |
This is a special case of Equation 1 of H&F [1]_. More generally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
midpoint: | ||
NumPy method kept for backwards compatibility. | ||
Uses ``(i + j) / 2``. | ||
=============================== =============== =============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to one table with all the methods listed in HF, including number 1-3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g
, which breaks the nice interpretation of j
and g
being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:
For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.
The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)
The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.
numpy/lib/_function_base_impl.py
Outdated
empirical distribution function of the data, i.e. | ||
:math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`. | ||
Then, different methods correspond to different choices of :math:`x` that | ||
fulfill the above inequalities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fulfill the above inequalities. | |
fulfill the above coverage conditions. |
numpy/lib/_function_base_impl.py
Outdated
:math:`F(y) = \\frac{1}{n} \\sum_i 1_{a_i \\leq y}`. | ||
Then, different methods correspond to different choices of :math:`x` that | ||
fulfill the above inequalities. | ||
|
||
For weighted quantiles, the above coverage conditions still hold. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For weighted quantiles, the above coverage conditions still hold. The | |
For weighted quantiles, the coverage conditions still hold. The |
numpy/lib/_function_base_impl.py
Outdated
Given a sample `a` from an underlying distribution, `quantile` provides a | ||
nonparametric estimate of the inverse cumulative distribution function. | ||
|
||
More formally, the quantile at probability level :math:`q` of a cumulative | ||
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure | ||
:math:`P` is defined as any number :math:`x` that fulfills the | ||
*coverage conditions* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that most methods listed in HF do not fulfil the coverage conditions.
In this restructuring, the line of argument that I introduces is a bit broken.
Still, I would very much like to keep the coverage conditions as they are very valuable for a reader and for quantiles!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely keep the coverage conditions. They are kind of the defining equations for a quantile in statistical terms. As estimations, sample quantiles are allowed to deviate. Note that P2 of HF is one of the coverage conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've kept them, and I've made your argument contiguous again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I'll make changes after other comments have come in.
numpy/lib/_function_base_impl.py
Outdated
Given a sample `a` from an underlying distribution, `quantile` provides a | ||
nonparametric estimate of the inverse cumulative distribution function. | ||
|
||
More formally, the quantile at probability level :math:`q` of a cumulative | ||
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure | ||
:math:`P` is defined as any number :math:`x` that fulfills the | ||
*coverage conditions* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will go ahead and move this below so that the argument is not broken. Also, if most methods don't fulfill the coverage conditions, then it is probably best not to introduce them when we're defining what the function does.
numpy/lib/_function_base_impl.py
Outdated
|
||
* if g > 0 ; then take j | ||
* if g = 0 ; then average between bounds | ||
By default (``method='linear'``), this is done by interpolating between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the recommendations in gh-24588 - and what I was inclined to do - was to start by addressing the default method before diving into the general framework. This lowers the initial complexity slightly, and I suspect that the majority of users are only interested in the default anyway. I've struck a compromise here: even though it starts as though it's only describing the default method, the transition to the general case is a very small step. I think this worked out nicely, so I'd prefer to keep it for now. We can leave the suggestion open so the maintainer who will review this (maybe @melissawm?) can see the suggestion.
midpoint: | ||
NumPy method kept for backwards compatibility. | ||
Uses ``(i + j) / 2``. | ||
=============================== =============== =============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another thing for the reviewing maintainer to consider. I would agree if the relationship between all the methods were simpler. Unfortunately, methods 1-3 use three different definitions of g
, which breaks the nice interpretation of j
and g
being the "integer and fractional components" of the same quantity. We could add a column to the table, but I left them out intentionally since these methods are quite different.
numpy/lib/_function_base_impl.py
Outdated
interpolated_inverted_cdf: | ||
method 4 of H&F [1]_. | ||
This method gives continuous results using: | ||
This a special case of Equation 1 of H&F [1]_. More generally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I dropped the ball on this! I left a couple of comments but please keep in mind I'm no expert in this area.
One thing that stood out to me reading the full docstring is the mention of "R types" in the method parameter. A quick scan of the H&F paper didn't help me - I tried tracking this change in the docs but didn't really manage to find why this comment was added (it was added in 1.22.0). If you know what this means, is there a way we can explain this comment? Otherwise it can be confusing for readers.
numpy/lib/_function_base_impl.py
Outdated
|
||
* if g > 0 ; then take j | ||
* if g = 0 ; then average between bounds | ||
By default (``method='linear'``), this is done by interpolating between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like:
By default (``method='linear'``), this is done by interpolating between | |
The optional method parameter specifies the method to use when the desired quantile is non-integer. By default (``method='linear'``), this is done by interpolating between |
I use "non-integer" as a way to express that the result is somehow ambiguous and an arbitrary decision must be made. Feel free to reword it, but the gist of this suggestion is mentioning that this decision/ambiguity is the reason for the method
parameter.
midpoint: | ||
NumPy method kept for backwards compatibility. | ||
Uses ``(i + j) / 2``. | ||
=============================== =============== =============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the compromise would be to mention this somehow, as a note to the reader that this table is not complete. Maybe something like move one sentence from a couple paragraphs down and write:
For the continuous estimators defined in [1]_, the table below shows how ``m`` is computed as a function of probability `q`:
Overall, much much better! Thank you so much @mdhaber and @lorentzenchr for the review! |
midpoint: | ||
NumPy method kept for backwards compatibility. | ||
Uses ``(i + j) / 2``. | ||
=============================== =============== =============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only three sentences that separate this table of continuous estimators from the discontinuous estimators below.
The first is a minor footnote about clipping the indices. If we cannot make this an actual footnote, the best place for it is immediately after the table. (Before the table would interrupt the explanation, and a Note admonition would only increase the separation between the table and list.)
The next two explain that the table includes only the continuous estimators and introduces the discontinuous estimators from that same paper. I've modified it slightly (below) to emphasize that the table does not include all the estimators from that paper.
numpy/lib/_function_base_impl.py
Outdated
The methods above are all continuous functions of probability `q`. | ||
Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where | ||
``j`` is defined as above and ``m`` and ``g`` are defined as follows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods above are all continuous functions of probability `q`. | |
Methods 1-3 of H&F [1]_ provide three discontinuous estimators, where | |
``j`` is defined as above and ``m`` and ``g`` are defined as follows. | |
The table above includes only the estimators from H&F that are continuous | |
functions of probability `q` (estimators 4-9). NumPy also provides the | |
three discontinuous estimators from H&F (estimators 1-3), where ``j`` is | |
defined as above and ``m`` and ``g`` are defined as follows. |
numpy/lib/_function_base_impl.py
Outdated
where ``m`` is defined differently for each value of the ``method`` | ||
parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where ``m`` is defined differently for each value of the ``method`` | |
parameter. | |
where ``m`` may be defined according to several different conventions. | |
The preferred convention may be selected using the ``method`` parameter: |
numpy/lib/_function_base_impl.py
Outdated
|
||
* if g > 0 ; then take j | ||
* if g = 0 ; then average between bounds | ||
By default (``method='linear'``), this is done by interpolating between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this, I've removed the mention of the method
parameter until later. Now we start with the simple default, then (only for readers who are still interested) introduce the method
parameter.
numpy/lib/_function_base_impl.py
Outdated
Given a sample `a` from an underlying distribution, `quantile` provides a | ||
nonparametric estimate of the inverse cumulative distribution function. | ||
|
||
More formally, the quantile at probability level :math:`q` of a cumulative | ||
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure | ||
:math:`P` is defined as any number :math:`x` that fulfills the | ||
*coverage conditions* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've kept them, and I've made your argument contiguous again.
Thanks @melissawm. I think the "R types" refer to the "Types" listed in the documentation of R's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, thank you @mdhaber !
I'll leave it open in case @lorentzenchr wants to take a last look. Cheers! |
Since this has been close to its present state for a few months, I'd ask whether we can merge as-is, and if there are additional changes to be made, they can be made in a new PR by the one who wants to make them? |
I'll have a quick look... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Would be good without them.
I really like this doc improvement. It reads so much better!
I have no rights to approve, but consider it a LGTM.
distribution function :math:`F(y)=P(Y \\leq y)` with probability measure | ||
:math:`P` is defined as any number :math:`x` that fulfills the | ||
*coverage conditions* | ||
Given a sample `a` from an underlying distribution, `quantile` provides a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes it's `a`
sometimes its ``a``
and it renders differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to numpydoc standards , single backticks are correct here because a
is an argument of this function. There were only single-backticks in the original docstring, and I've used single backticks consistently here. However, I see that some of each were added in gh-24254, and there's one double-backtick left that I didn't notice, since I wrote this before that PR merged. I added #25704 (comment) to fix this.
For more information, see numpy/numpydoc#52.
|
||
.. math:: | ||
i + g = q * ( n - alpha - beta + 1 ) + alpha | ||
By default, this is done by interpolating between adjacent elements in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, this is done by interpolating between adjacent elements in | |
By default (`method="linear"`), this is done by interpolating between adjacent elements in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this, but removed it to respond to another comment.
method 2 of H&F [1]_. | ||
This method gives discontinuous results: | ||
- ``j = (q*n + m - 1) // 1``, and | ||
- ``g = (q*n + m - 1) % 1``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add here, that the -1
is for translating from 1-based (H&F) to 0-based (Python) indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
* alpha = 0 | ||
* beta = 1 | ||
1. ``inverted_cdf``: ``m = 0`` and ``g = int(q*n > 0)`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, this is the version used the most in statistical literature and corresponds to H&F's first equation in the introduction. But this equation is not general enough, it neglects distributions with flat regions, e.g. step functions.
Summary: better no action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Does "better no action" mean that there is no action to take to respond to this comment, or that I should not have taken action to adjust this part of the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suggest to leave it as is. Just wanted to share thoughts with a fellow statistian.
|
||
* if g > 0 ; then take j | ||
* if g = 0 ; then average between bounds | ||
where ``m`` may be defined according to several different conventions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To respond to this,
where ``m`` may be defined according to several different conventions. | |
where the ``- 1`` in these formulas accounts for Python's 0-based indexing, | |
and ``m`` may be defined according to several different conventions. |
I suppose I left this out original because I chose not to explain all of the ways in which the formulas are not written as they are in the paper. For instance, I also use %
to take the fractional part rather than subtracting the integral part. This seems equally notable to me as an adaptation from math to code. While adding such information would be somewhat helpful those who want to compare this to the original text, I think they make it harder to follow for those who are reading only this document. I would prefer not to add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//
and %
are clear and don’t need explanation.
The different base of indexing is in my opinion worth mentioning somewhere. It took me a while to figure it out.
Maybe a good place is below the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg Why did you close? |
@seberg Why did you close? Ouch, I honstly don't know how that happened! |
I see the Windows CI failure in another PR (gh-26386); seems unrelated. |
Had a quick read through and it looks good and it I think things were settling. Thanks everyone, let's get this in! |
gh-24588 noted inaccuracies in the
quantile
documentation. This PR corrects these and simplifies the descriptions of the various methods. It also refers users ofpercentile
,nanquantile
,nanpercentile
to the documentation ofquantile
to reduce redunandancy and ensure that the notes of these functions do not get out of sync.Closes gh-24588
@lorentzenchr