BUG: np.mean is slower than np.sum + a division #21455

eendebakpt · 2022-05-05T20:20:38Z

Describe the issue:

Using np.mean takes longer than np.sum and then manually dividing by the number of elements.

Benchmark result of np.mean(x) vs. np.sum(x)/x.size:

np.mean  0.8456945639991318
np.sum + division 0.5549134930006403

(benchmark itself in the reproduce code block below)

Analysis

Using valgrind I did some profiling on np.mean.

Below the ufunc_generic_fastcall there are large 4 blocks that I analyse below.

There is an expensive call to solve_may_share_memory (via ufunc_generic_fastcall -> PyUFunc_GenericFunctionInternal -> try_trivial_single_output_loop -> PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK -> solve_may_share_memory). This is for a ufunc called divide (the second step in the mean calculation).
But the call to the ufunc_generic_fastcall was without keyword arguments and out_is_passed_by_position=0. So a new output array is allocated for the result and this cannot overlap with the input arrays. If this is really the case, can we pass this knowledge to try_trivial_single_output_loop somehow to prevent the PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK check?
The PyUFunc_CheckOverride is expensive due to the __array_ufunc__ attribute missing. There are multiple ways of improving this, but there is already an issue at cpython: Performance of attribute lookup for type objects python/cpython#92216 than might help here
Resolving PyUFunc_TrueDivisionTypeResolver takes some time. The method is from ufunc_type_resolution.c, which is marked as legacy code. Can we avoid getting at this point? It is called from promote_and_get_info_and_ufuncimpl (dispatching.c). In the code it is noted that
"Using promotion failed, this should normally be an error.". What is happening there? All types involves are float64 I guess, so it should not be hard to find the right ufunc implementation.
Why the PyArray_CastToType? It is called from check_for_trivial_loop where must_copy is True. The array is small so a copy is made

@seberg

Reproduce the code example:

import numpy as np
import time

x=np.random.rand(5,8)   
niter=10_0000

t0=time.perf_counter()
for kk in range(niter):
        v1=np.mean(x, axis=1)
dt1=time.perf_counter()-t0
print(f'np.mean  {dt1}')

t0=time.perf_counter()
for kk in range(niter):
    v=np.sum(x, axis=1)
    v2=v/x.shape[1]
dt2=time.perf_counter()-t0
print(f'np.sum + division {dt2}')

np.testing.assert_array_equal(v1, v2)

Error message:

No response

NumPy/Python version information:

import sys, numpy; print(numpy.version, sys.version)
1.23.0.dev0+1127.g7f77205be 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]

The text was updated successfully, but these errors were encountered:

seberg · 2022-05-05T20:51:47Z

About the last two points, the problem is that we are dividing by an integer scalar? So there is a cast necessary and the promotion isn't quite trivial unfortunately (also because it is a scalar, so the legacy code kicks in, which inspects the value :/, trying to push that out, but its slow progress).
So we cannot easily avoid the "Resolver" because of the scalar integer here (or I couldn't think of a way).

Now, I admit, I think some of those paths that need to fall back got slower when scalars were involved, maybe that is part of what you see now. (I remember there was one path that got slower, while most others got faster, but I forgot which it was)

Te overlap check, it is behind a if (op[nin] == NULL) { check so shouldn't get called if out is not passed explicitly? But, np.mean does pass out explicitly. In fact, out is identical to the first input there (which is OK, the overlap detection should say that they are the same).
Now, what might be possible is to do an identity check for these situations in PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK (or similar), since this is a common situation for in-place operators!

A fun way to improve the ~~__array_function__~~ __array_ufunc__ might be to do rcount = int(rcount) to make it a python integer.

It is good to look at this closer, but mean is a pure Python function (you probably already found it in numpy/core/_methods.py, and to really knock away the last bit of overheads, we might have to reconsider that?!

seberg · 2022-05-05T21:04:31Z

Interesting, when where= was introduced, it may be that we switched from getting 0 to getting np.intp(0) out from the counting function. That might well be the big slow-down here, since as I had said, returning int might hit a few faster paths (i.e. avoid the _array_ufunc__ stuff).

seberg · 2022-05-05T21:15:34Z

Just to confirm that:

diff --git a/numpy/core/_methods.py b/numpy/core/_methods.py
index a239e2c87..b6ec15714 100644
--- a/numpy/core/_methods.py
+++ b/numpy/core/_methods.py
@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True):
             axis = tuple(range(arr.ndim))
         elif not isinstance(axis, tuple):
             axis = (axis,)
-        items = nt.intp(1)
+        items = 1
         for ax in axis:
             items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
     else:

speeds things up quite a lot in your example. I do not know if scalar math is involved here (or how much), so my scalar-math update may help. But I suspect this also helps for the overrides etc.

The method `PyUFuncOverride_GetNonDefaultArrayUfunc` is expensive on numpy scalars because these objects do not have a `__array_ufunc__` set and for a missing attribute lookup cpython generates an exception that is later cleared by numpy. This is a performance bottleneck, see #21455. An issue has been submitted to cpython (python/cpython#92216). But even if this is addressed in cpython, it will take untill python 3.12+ before this will be useable by numpy. As an alternative solution, this PR adds a fast path to `PyUFuncOverride_GetNonDefaultArrayUfunc` to determine whether an object is a numpy scalar.

eendebakpt · 2022-05-12T20:27:25Z

With the PRs linking to this issue merged the difference is smaller. np.mean is a python function, so some overhead is to be expected.

There are two more items that can be addressed to reduce the difference:

PERF: Drastically lower the overhead of __array_function__ #20790 by @seberg
Avoiding the legacy resolving. See PERF: Use python integer on _count_reduce_items #21465 (comment)

Closing this issue

…E_OVERLAP_OK (#21464) Addresses one of the items in #21455 * fast check on equivalent arrays in PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK * fix for case stride1==0 * remove assert statements * whitespace * DOC: Add comment that output broadcasting is rejected by overlap detection That is a bit of a weird dynamic here, and it would probably nice to clean up. However, it is also not a reason to not add the fast-path right now. --- Note that it may also be good to just use a data-pointer comparison or push this fast-path into the overlap detection function itself.

seberg · 2022-05-24T21:42:57Z

Now that branching happened, I can put you on the trace that our generic objects should probably not have a custom tp_alloc at all ;).
There is a tiny chance that this affects downstream subclasses, but I doubt it, since user dtypes shouldn't really be able to require NEEDS_INIT for anything reasonable.

eendebakpt · 2022-05-24T21:51:35Z

Now that branching happened, I can put you on the trace that our generic objects should probably not have a custom tp_alloc at all ;). There is a tiny chance that this affects downstream subclasses, but I doubt it, since user dtypes shouldn't really be able to require NEEDS_INIT for anything reasonable.

@seberg Not sure I understand this.

seberg · 2022-05-24T21:56:07Z

The np.generic scalar provides a tp_alloc which is inherited by all scalars. This does some unnecessary initialization (only void really needs to initialize).
Removing that tp_alloc() will mean that np.float64, etc. use the default Python allocation directly, which probably removes a tiny bit of overhead.

eendebakpt · 2022-05-25T18:40:16Z

Do you mean we can modify PyGenericArrType_Type.tp_alloc = gentype_alloc here https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/scalartypes.c.src#L3874 and set if to _PyObject_New?

The difference between gentype_alloc and _PyObject_New is allocation of one item more, a memset(obj, 0, size) and a special case for type->tp_itemsize == 0.
Is there are information why those differences are there?

@seberg

seberg · 2022-05-25T19:20:55Z

Hmmm, the generic version is PyType_GenericAlloc I think (i.e. what is used if you just delete that line of code).
Seems that does exactly the same stuff, except that it checks for whether it needs storage for the GC.

I thought I had timed it to make things faster if I remove the custom version. Maybe a fluke, or maybe reusing that version is actually good (because it is used all over the place).

EDIT: I suppose in principle, we could not remove it, but just cut it down a lot (no need to memset or itemsize check for our scalars really)

eendebakpt added the 00 - Bug label May 5, 2022

eendebakpt closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: np.mean is slower than np.sum + a division #21455

BUG: np.mean is slower than np.sum + a division #21455

eendebakpt commented May 5, 2022

seberg commented May 5, 2022 •

edited

Loading

seberg commented May 5, 2022

seberg commented May 5, 2022

eendebakpt commented May 12, 2022

seberg commented May 24, 2022

eendebakpt commented May 24, 2022

seberg commented May 24, 2022

eendebakpt commented May 25, 2022

seberg commented May 25, 2022 •

edited

Loading

BUG: np.mean is slower than np.sum + a division #21455

BUG: np.mean is slower than np.sum + a division #21455

Comments

eendebakpt commented May 5, 2022

Describe the issue:

Reproduce the code example:

Error message:

NumPy/Python version information:

seberg commented May 5, 2022 • edited Loading

seberg commented May 5, 2022

seberg commented May 5, 2022

eendebakpt commented May 12, 2022

seberg commented May 24, 2022

eendebakpt commented May 24, 2022

seberg commented May 24, 2022

eendebakpt commented May 25, 2022

seberg commented May 25, 2022 • edited Loading

seberg commented May 5, 2022 •

edited

Loading

seberg commented May 25, 2022 •

edited

Loading