DOC: Improve `np.mean` documentation of the out argument #25431

pieleric · 2023-12-19T13:05:15Z

I've spent a couple of hours pulling my hair out on a computation that should work but didn't. Eventually, it turned out that was due to a limitation of the np.mean() function. I thought I'd document it, for others avoid spending too much time on it.

Essentially, the limitation looks like that:

>>> outi8 = numpy.empty((1,), numpy.uint8)
>>> numpy.mean([[173, 173]], axis=1, out=outi8, dtype=numpy.uint16)
array([45], dtype=uint8)

=> Document the effect of using the out argument, which can cause under/overflow issues with integers dtype.

ngoldbaum · 2023-12-19T15:31:09Z

This is a generic problem with how numpy integer arrays behave with overflow, see #8987. If that were fixed, your code would generate a warning or error on overflow.

Because this is a general problem with numpy integer arrays, I'm not sure it's very useful to only document it once in this function. It would be better to have a place in the dtype documentation that goes over this problem and then we can link to it from elsewhere.

rossbar · 2023-12-19T18:42:36Z

I'm not sure it's very useful to only document it once in this function. It would be better to have a place in the dtype documentation that goes over this problem and then we can link to it from elsewhere.

Agreed - I'm -1 on adding a blurb like this individually to every out parameter description or any other function that has an accumulator.

pieleric · 2023-12-19T23:56:23Z

Sorry, I made a mistake in the example code. I had a int8 (which never fits 173) instead of uint8 (which does). I've updated the description.

I think the issue is not per-se related to the overflows being "silent" in numpy. I agree with you that if there had been a warning, I would have spent a lot less time finding out the issue. But iIt's specific to the way the mean() function is currently implemented, and to some extend the way the dtype parameter is used in this function.

Maybe, to clarify, here is a code that works:

>>> outi8 = numpy.empty((1,), numpy.uint8)
>>> outi8[:] = numpy.mean([[173, 173]], axis=1, dtype=numpy.uint16)
>>> outi8
array([173], dtype=uint8)

From the current documentation, I assumed this code and the one in the description were equivalent, but would avoid creating an intermediary array of uint16. I assumed that for each final value, mean() would sum all the corresponding elements (using dtype), and divide it by the number of elements, to immediately store the result in out. Such implementation would probably be faster (due to the result always being in cache) and avoid the current limitation. I didn't expect it would first do the sum on the whole array, and only later divide the intermediary array by the number of elements.

Moreover, typically, in most numpy functions, the dtype argument defines the dtype of the output of the function. For mean(), the documentation clearly states that dtype is used to define the intermediary accumulator type. So when reading the documentation it is easy to believe that the intermediary accumulator type is independent from the dtype of out.

Honestly, my first idea was also to document better this behaviour inside the ufunc documentation. However, I think their behaviour is totally normal and expected (assuming you're aware of under/overflows). The fact that umath.add.reduce(...,out) stores the result using the dtype of out is nothing surprising. What is odd that mean(dtype, out) uses the dtype of out as intermediary dtype.

ngoldbaum · 2023-12-21T16:57:33Z

Moreover, typically, in most numpy functions, the dtype argument defines the dtype of the output of the function. For mean(), the documentation clearly states that dtype is used to define the intermediary accumulator type.

Ufunc reductions also behave like this (e.g. arr.sum(), which is shorthand for np.add.reduce(arr). Mean is actually calling np.add.reduce in its implementation.

So like we said above, this is a generic feature of ufunc reductions. If you specify an out array, that array will be used for the accumulation.

pieleric · 2023-12-21T17:19:07Z

Ufunc reductions also behave like this (e.g. arr.sum(), which is shorthand for np.add.reduce(arr). Mean is actually calling np.add.reduce in its implementation.

So like we said above, this is a generic feature of ufunc reductions. If you specify an out array, that array will be used for the accumulation.

Yes. I do agree on this. I think it's logical that the result is stored in the out argument. The documentation of the ufunc explains it sufficiently IMHO. However, when I was using mean(), I didn't have in mind that it would use sum() and place the intermediary value in out. With mean, mathematically, you can expect that the output is always within the same range as the input. So I was convinced that using a out with the same dtype as the input would always work. This is not the case. That is this behaviour that I'd like to document.

How (and where) would you suggest to clarify it? I can adjust this PR accordingly.

ngoldbaum · 2023-12-21T17:23:13Z

I personally don't see mean as being special here. Conceptually, it's a reduction ufunc, just like sum is. out is also not behaving any differently in mean compared with any other reduction in numpy.

charris · 2023-12-21T17:27:05Z

Nothing wrong with a bit of duplicate documentation, many may be unfamiliar with the larger idea of "reduction".

mdhaber · 2023-12-28T16:42:56Z

I've considered this a bit more carefully, and I think a documentation improvement of some kind would be helpful.

I agree that one would expect the overflow of:

numpy.add.reduce([[173, 173]], axis=1, out=outi8, dtype=numpy.uint16)

The documentation of Ufunc methods does state that dtype is ignore if out is provided. I'm not sure if it's explicit, but I can infer that out will control the precision of intermediate calculations since it causes dtype to be ignored.

However, the user does not necessarily know that mean is implemented like this. I don't think the mean documentation makes this connections explicit, although there are a few hints. It does state that:

Note that for floating-point input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higher-precision accumulator using the dtype keyword can alleviate this issue.

But it doesn't mention that dtype is ignored if out is specified. So I think there is some room for documentation improvement here.

I considered an analogous example, although one less likely to be encountered in the wild:

numpy.std([[0, 1e20]], axis=1, dtype=np.float64)
# array([5.e+19])
numpy.std([[0, 1e20]], axis=1, dtype=np.float64).astype(np.float32)
# array([5.e+19], dtype=float32)

out = np.zeros((1,), dtype=np.float32)
numpy.std([[0, 1e20]], axis=1, dtype=np.float64, out=out)
# array([inf], dtype=float32)

So there are at least a few other functions where this documentation could be added (not to mention nanmean).

There seems to be some support for linking to generic documentation. What do you think of explicitly stating in the documentation of reducing statistics that they behave like "reduce-like methods" to use the phrasing of:

and link to https://numpy.org/doc/stable/user/basics.ufuncs.html#ufunc-methods?

I think it is also worth mentioning in the dtype descriptions that dtype is ignored if out is provided.

Would that address the issue @pieleric?

pieleric · 2024-01-04T08:29:07Z

Would that address the issue @pieleric?

Yes @mdhaber, that would most likely have saved me time, as I did read the ufuncs documentation, hopping I'd find something that explain this behaviour. So it should help others too.

BTW, to be precise, when out is passed, dtype is not ignored, it still affects the result if it's smaller than the dtype of out. For instance:

>>> outi16 = numpy.empty((1,), numpy.uint16)
>>> numpy.mean([[173, 173]], axis=1, out=outi16, dtype=numpy.uint8)
array([45], dtype=uint16)

I will adjust this PR with a suggestion for the ufuncs documentation.

Document the effect of using the out argument, which can cause under/overflow issues with integers dtype. Also correct the description of how out with dtype is handled in ufuncs in general.

pieleric · 2024-01-08T23:38:22Z

@mdhaber , I've updated this PR according to your suggestions. Honestly, I struggled a bit to find a good way to describe the behaviour. Let me know if you can think of a better English description of the limitation 😄 I didn't update all the "reducing statistics" function yet, as I'd rather first hear you feedback before copy-pasting to all the functions' docstring.

I've also taken the opportunity to correct the description of out and dtype together, as dtype is not ignored.

doc/source/user/basics.ufuncs.rst

numpy/_core/fromnumeric.py

numpy/lib/_nanfunctions_impl.py

doc/source/user/basics.ufuncs.rst

[skip ci]

mdhaber · 2024-02-03T04:25:40Z

Will merge when CI finishes.

pieleric · 2024-02-05T08:31:30Z

numpy/_core/fromnumeric.py

@@ -3505,6 +3505,7 @@ def mean(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, *,
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.
        See :ref:`ufuncs-output-type` for more details.
+        See :ref:`ufuncs-output-type` for more details.


@mdhaber , this line is now duplicated.

pieleric · 2024-02-05T17:18:24Z

@mdhaber , thanks for looking into this PR. Sorry for not reacting before. I was sick, and had to focus on a lot of other things with the little time I had.
I've made a PR concerning the way "dtype" and "out" arguments interact with each other. See PR #25768 .
I'll try to make again a PR concerning the original topic of "out" being used as an extra "dtype" for some of the "reduce-like methods". I'm not sure how this would fit the ufuncs documentation (as these methods are not ufuncs), but I'll try...

github-actions bot added the 04 - Documentation label Dec 19, 2023

rossbar added the 57 - Close? Issues which may be closable unless discussion continued label Dec 19, 2023

DOC: Improve np.mean documentation of the out argument

30d3294

Document the effect of using the out argument, which can cause under/overflow issues with integers dtype. Also correct the description of how out with dtype is handled in ufuncs in general.

pieleric force-pushed the main branch from 037cb28 to 30d3294 Compare January 8, 2024 23:30

mdhaber requested changes Jan 15, 2024

View reviewed changes

doc/source/user/basics.ufuncs.rst Outdated Show resolved Hide resolved

numpy/_core/fromnumeric.py Outdated Show resolved Hide resolved

numpy/_core/fromnumeric.py Outdated Show resolved Hide resolved

numpy/lib/_nanfunctions_impl.py Outdated Show resolved Hide resolved

mdhaber reviewed Feb 3, 2024

View reviewed changes

doc/source/user/basics.ufuncs.rst Outdated Show resolved Hide resolved

mdhaber added 2 commits February 2, 2024 20:24

Apply suggestions from code review

f089f33

[skip ci]

Merge branch 'main' into main

ada2f31

mdhaber approved these changes Feb 3, 2024

View reviewed changes

mdhaber merged commit 5c34c87 into numpy:main Feb 3, 2024

pieleric commented Feb 5, 2024

View reviewed changes

Uh oh!

DOC: Improve np.mean documentation of the out argument #25431

DOC: Improve np.mean documentation of the out argument #25431

Uh oh!

Conversation

pieleric commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Dec 19, 2023

Uh oh!

rossbar commented Dec 19, 2023

Uh oh!

pieleric commented Dec 19, 2023

Uh oh!

ngoldbaum commented Dec 21, 2023

Uh oh!

pieleric commented Dec 21, 2023

Uh oh!

ngoldbaum commented Dec 21, 2023

Uh oh!

charris commented Dec 21, 2023

Uh oh!

mdhaber commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pieleric commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pieleric commented Jan 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdhaber commented Feb 3, 2024

Uh oh!

pieleric Feb 5, 2024

Choose a reason for hiding this comment

Uh oh!

pieleric commented Feb 5, 2024

Uh oh!

Uh oh!

DOC: Improve `np.mean` documentation of the out argument #25431

DOC: Improve `np.mean` documentation of the out argument #25431

pieleric commented Dec 19, 2023 •

edited

Loading

mdhaber commented Dec 28, 2023 •

edited

Loading

pieleric commented Jan 4, 2024 •

edited

Loading