Skip to content

DOC: Improve np.mean documentation of the out argument #25431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 3, 2024

Conversation

pieleric
Copy link
Contributor

@pieleric pieleric commented Dec 19, 2023

I've spent a couple of hours pulling my hair out on a computation that should work but didn't. Eventually, it turned out that was due to a limitation of the np.mean() function. I thought I'd document it, for others avoid spending too much time on it.

Essentially, the limitation looks like that:

>>> outi8 = numpy.empty((1,), numpy.uint8)
>>> numpy.mean([[173, 173]], axis=1, out=outi8, dtype=numpy.uint16)
array([45], dtype=uint8)

=> Document the effect of using the out argument, which can cause under/overflow issues with integers dtype.

@ngoldbaum
Copy link
Member

This is a generic problem with how numpy integer arrays behave with overflow, see #8987. If that were fixed, your code would generate a warning or error on overflow.

Because this is a general problem with numpy integer arrays, I'm not sure it's very useful to only document it once in this function. It would be better to have a place in the dtype documentation that goes over this problem and then we can link to it from elsewhere.

@rossbar
Copy link
Contributor

rossbar commented Dec 19, 2023

I'm not sure it's very useful to only document it once in this function. It would be better to have a place in the dtype documentation that goes over this problem and then we can link to it from elsewhere.

Agreed - I'm -1 on adding a blurb like this individually to every out parameter description or any other function that has an accumulator.

@rossbar rossbar added the 57 - Close? Issues which may be closable unless discussion continued label Dec 19, 2023
@pieleric
Copy link
Contributor Author

Sorry, I made a mistake in the example code. I had a int8 (which never fits 173) instead of uint8 (which does). I've updated the description.

I think the issue is not per-se related to the overflows being "silent" in numpy. I agree with you that if there had been a warning, I would have spent a lot less time finding out the issue. But iIt's specific to the way the mean() function is currently implemented, and to some extend the way the dtype parameter is used in this function.

Maybe, to clarify, here is a code that works:

>>> outi8 = numpy.empty((1,), numpy.uint8)
>>> outi8[:] = numpy.mean([[173, 173]], axis=1, dtype=numpy.uint16)
>>> outi8
array([173], dtype=uint8)

From the current documentation, I assumed this code and the one in the description were equivalent, but would avoid creating an intermediary array of uint16. I assumed that for each final value, mean() would sum all the corresponding elements (using dtype), and divide it by the number of elements, to immediately store the result in out. Such implementation would probably be faster (due to the result always being in cache) and avoid the current limitation. I didn't expect it would first do the sum on the whole array, and only later divide the intermediary array by the number of elements.

Moreover, typically, in most numpy functions, the dtype argument defines the dtype of the output of the function. For mean(), the documentation clearly states that dtype is used to define the intermediary accumulator type. So when reading the documentation it is easy to believe that the intermediary accumulator type is independent from the dtype of out.

Honestly, my first idea was also to document better this behaviour inside the ufunc documentation. However, I think their behaviour is totally normal and expected (assuming you're aware of under/overflows). The fact that umath.add.reduce(...,out) stores the result using the dtype of out is nothing surprising. What is odd that mean(dtype, out) uses the dtype of out as intermediary dtype.

@ngoldbaum
Copy link
Member

Moreover, typically, in most numpy functions, the dtype argument defines the dtype of the output of the function. For mean(), the documentation clearly states that dtype is used to define the intermediary accumulator type.

Ufunc reductions also behave like this (e.g. arr.sum(), which is shorthand for np.add.reduce(arr). Mean is actually calling np.add.reduce in its implementation.

So like we said above, this is a generic feature of ufunc reductions. If you specify an out array, that array will be used for the accumulation.

@pieleric
Copy link
Contributor Author

Ufunc reductions also behave like this (e.g. arr.sum(), which is shorthand for np.add.reduce(arr). Mean is actually calling np.add.reduce in its implementation.

So like we said above, this is a generic feature of ufunc reductions. If you specify an out array, that array will be used for the accumulation.

Yes. I do agree on this. I think it's logical that the result is stored in the out argument. The documentation of the ufunc explains it sufficiently IMHO. However, when I was using mean(), I didn't have in mind that it would use sum() and place the intermediary value in out. With mean, mathematically, you can expect that the output is always within the same range as the input. So I was convinced that using a out with the same dtype as the input would always work. This is not the case. That is this behaviour that I'd like to document.

How (and where) would you suggest to clarify it? I can adjust this PR accordingly.

@ngoldbaum
Copy link
Member

I personally don't see mean as being special here. Conceptually, it's a reduction ufunc, just like sum is. out is also not behaving any differently in mean compared with any other reduction in numpy.

@charris
Copy link
Member

charris commented Dec 21, 2023

Nothing wrong with a bit of duplicate documentation, many may be unfamiliar with the larger idea of "reduction".

@mdhaber
Copy link
Contributor

mdhaber commented Dec 28, 2023

I've considered this a bit more carefully, and I think a documentation improvement of some kind would be helpful.

I agree that one would expect the overflow of:

numpy.add.reduce([[173, 173]], axis=1, out=outi8, dtype=numpy.uint16)

The documentation of Ufunc methods does state that dtype is ignore if out is provided. I'm not sure if it's explicit, but I can infer that out will control the precision of intermediate calculations since it causes dtype to be ignored.

However, the user does not necessarily know that mean is implemented like this. I don't think the mean documentation makes this connections explicit, although there are a few hints. It does state that:

Note that for floating-point input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higher-precision accumulator using the dtype keyword can alleviate this issue.

But it doesn't mention that dtype is ignored if out is specified. So I think there is some room for documentation improvement here.

I considered an analogous example, although one less likely to be encountered in the wild:

numpy.std([[0, 1e20]], axis=1, dtype=np.float64)
# array([5.e+19])
numpy.std([[0, 1e20]], axis=1, dtype=np.float64).astype(np.float32)
# array([5.e+19], dtype=float32)

out = np.zeros((1,), dtype=np.float32)
numpy.std([[0, 1e20]], axis=1, dtype=np.float64, out=out)
# array([inf], dtype=float32)

So there are at least a few other functions where this documentation could be added (not to mention nanmean).

There seems to be some support for linking to generic documentation. What do you think of explicitly stating in the documentation of reducing statistics that they behave like "reduce-like methods" to use the phrasing of:
image

and link to https://numpy.org/doc/stable/user/basics.ufuncs.html#ufunc-methods?

I think it is also worth mentioning in the dtype descriptions that dtype is ignored if out is provided.

Would that address the issue @pieleric?

@pieleric
Copy link
Contributor Author

pieleric commented Jan 4, 2024

Would that address the issue @pieleric?

Yes @mdhaber, that would most likely have saved me time, as I did read the ufuncs documentation, hopping I'd find something that explain this behaviour. So it should help others too.

BTW, to be precise, when out is passed, dtype is not ignored, it still affects the result if it's smaller than the dtype of out. For instance:

>>> outi16 = numpy.empty((1,), numpy.uint16)
>>> numpy.mean([[173, 173]], axis=1, out=outi16, dtype=numpy.uint8)
array([45], dtype=uint16)

I will adjust this PR with a suggestion for the ufuncs documentation.

Document the effect of using the out argument, which can cause under/overflow
issues with integers dtype.

Also correct the description of how out with dtype is handled in ufuncs in general.
@pieleric
Copy link
Contributor Author

pieleric commented Jan 8, 2024

@mdhaber , I've updated this PR according to your suggestions. Honestly, I struggled a bit to find a good way to describe the behaviour. Let me know if you can think of a better English description of the limitation 😄 I didn't update all the "reducing statistics" function yet, as I'd rather first hear you feedback before copy-pasting to all the functions' docstring.

I've also taken the opportunity to correct the description of out and dtype together, as dtype is not ignored.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 3, 2024

Will merge when CI finishes.

@mdhaber mdhaber merged commit 5c34c87 into numpy:main Feb 3, 2024
@@ -3505,6 +3505,7 @@ def mean(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, *,
is ``None``; if provided, it must have the same shape as the
expected output, but the type will be cast if necessary.
See :ref:`ufuncs-output-type` for more details.
See :ref:`ufuncs-output-type` for more details.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber , this line is now duplicated.

@pieleric
Copy link
Contributor Author

pieleric commented Feb 5, 2024

@mdhaber , thanks for looking into this PR. Sorry for not reacting before. I was sick, and had to focus on a lot of other things with the little time I had.
I've made a PR concerning the way "dtype" and "out" arguments interact with each other. See PR #25768 .
I'll try to make again a PR concerning the original topic of "out" being used as an extra "dtype" for some of the "reduce-like methods". I'm not sure how this would fit the ufuncs documentation (as these methods are not ufuncs), but I'll try...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
04 - Documentation 57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants