Skip to content

BUG: np.mean is slower than np.sum + a division #21455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eendebakpt opened this issue May 5, 2022 · 9 comments
Closed

BUG: np.mean is slower than np.sum + a division #21455

eendebakpt opened this issue May 5, 2022 · 9 comments
Labels

Comments

@eendebakpt
Copy link
Contributor

Describe the issue:

Using np.mean takes longer than np.sum and then manually dividing by the number of elements.

Benchmark result of np.mean(x) vs. np.sum(x)/x.size:

np.mean  0.8456945639991318
np.sum + division 0.5549134930006403

(benchmark itself in the reproduce code block below)

Analysis

Using valgrind I did some profiling on np.mean.

np_mean_axis_1

Below the ufunc_generic_fastcall there are large 4 blocks that I analyse below.

  • There is an expensive call to solve_may_share_memory (via ufunc_generic_fastcall -> PyUFunc_GenericFunctionInternal -> try_trivial_single_output_loop -> PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK -> solve_may_share_memory). This is for a ufunc called divide (the second step in the mean calculation).
    But the call to the ufunc_generic_fastcall was without keyword arguments and out_is_passed_by_position=0. So a new output array is allocated for the result and this cannot overlap with the input arrays. If this is really the case, can we pass this knowledge to try_trivial_single_output_loop somehow to prevent the PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK check?

  • The PyUFunc_CheckOverride is expensive due to the __array_ufunc__ attribute missing. There are multiple ways of improving this, but there is already an issue at cpython: Performance of attribute lookup for type objects python/cpython#92216 than might help here

  • Resolving PyUFunc_TrueDivisionTypeResolver takes some time. The method is from ufunc_type_resolution.c, which is marked as legacy code. Can we avoid getting at this point? It is called from promote_and_get_info_and_ufuncimpl (dispatching.c). In the code it is noted that
    "Using promotion failed, this should normally be an error.". What is happening there? All types involves are float64 I guess, so it should not be hard to find the right ufunc implementation.

  • Why the PyArray_CastToType? It is called from check_for_trivial_loop where must_copy is True. The array is small so a copy is made

@seberg

Reproduce the code example:

import numpy as np
import time

x=np.random.rand(5,8)   
niter=10_0000

t0=time.perf_counter()
for kk in range(niter):
        v1=np.mean(x, axis=1)
dt1=time.perf_counter()-t0
print(f'np.mean  {dt1}')

t0=time.perf_counter()
for kk in range(niter):
    v=np.sum(x, axis=1)
    v2=v/x.shape[1]
dt2=time.perf_counter()-t0
print(f'np.sum + division {dt2}')

np.testing.assert_array_equal(v1, v2)

Error message:

No response

NumPy/Python version information:

import sys, numpy; print(numpy.version, sys.version)
1.23.0.dev0+1127.g7f77205be 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]

@seberg
Copy link
Member

seberg commented May 5, 2022

About the last two points, the problem is that we are dividing by an integer scalar? So there is a cast necessary and the promotion isn't quite trivial unfortunately (also because it is a scalar, so the legacy code kicks in, which inspects the value :/, trying to push that out, but its slow progress).
So we cannot easily avoid the "Resolver" because of the scalar integer here (or I couldn't think of a way).

Now, I admit, I think some of those paths that need to fall back got slower when scalars were involved, maybe that is part of what you see now. (I remember there was one path that got slower, while most others got faster, but I forgot which it was)

Te overlap check, it is behind a if (op[nin] == NULL) { check so shouldn't get called if out is not passed explicitly? But, np.mean does pass out explicitly. In fact, out is identical to the first input there (which is OK, the overlap detection should say that they are the same).
Now, what might be possible is to do an identity check for these situations in PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK (or similar), since this is a common situation for in-place operators!

A fun way to improve the __array_function__ __array_ufunc__ might be to do rcount = int(rcount) to make it a python integer.

It is good to look at this closer, but mean is a pure Python function (you probably already found it in numpy/core/_methods.py, and to really knock away the last bit of overheads, we might have to reconsider that?!

@seberg
Copy link
Member

seberg commented May 5, 2022

Interesting, when where= was introduced, it may be that we switched from getting 0 to getting np.intp(0) out from the counting function. That might well be the big slow-down here, since as I had said, returning int might hit a few faster paths (i.e. avoid the _array_ufunc__ stuff).

@seberg
Copy link
Member

seberg commented May 5, 2022

Just to confirm that:

diff --git a/numpy/core/_methods.py b/numpy/core/_methods.py
index a239e2c87..b6ec15714 100644
--- a/numpy/core/_methods.py
+++ b/numpy/core/_methods.py
@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True):
             axis = tuple(range(arr.ndim))
         elif not isinstance(axis, tuple):
             axis = (axis,)
-        items = nt.intp(1)
+        items = 1
         for ax in axis:
             items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
     else:

speeds things up quite a lot in your example. I do not know if scalar math is involved here (or how much), so my scalar-math update may help. But I suspect this also helps for the overrides etc.

seberg pushed a commit that referenced this issue May 11, 2022
The method `PyUFuncOverride_GetNonDefaultArrayUfunc` is expensive on numpy scalars because these objects do not have a `__array_ufunc__` set and for a missing attribute lookup cpython generates an exception that is later cleared by numpy. This is a performance bottleneck, see #21455.

An issue has been submitted to cpython (python/cpython#92216). But even if this is addressed in cpython, it will take untill python 3.12+ before this will be useable by numpy.

As an alternative solution, this PR adds a fast path to `PyUFuncOverride_GetNonDefaultArrayUfunc` to determine whether an object is a numpy scalar.
@eendebakpt
Copy link
Contributor Author

With the PRs linking to this issue merged the difference is smaller. np.mean is a python function, so some overhead is to be expected.

There are two more items that can be addressed to reduce the difference:

Closing this issue

seberg pushed a commit that referenced this issue May 14, 2022
…E_OVERLAP_OK (#21464)

Addresses one of the items in #21455

* fast check on equivalent arrays in PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK

* fix for case stride1==0

* remove assert statements

* whitespace

* DOC: Add comment that output broadcasting is rejected by overlap detection

That is a bit of a weird dynamic here, and it would probably nice to clean
up.  However, it is also not a reason to not add the fast-path right now.

---

Note that it may also be good to just use a data-pointer comparison or push this fast-path into the overlap detection function itself.
@seberg
Copy link
Member

seberg commented May 24, 2022

Now that branching happened, I can put you on the trace that our generic objects should probably not have a custom tp_alloc at all ;).
There is a tiny chance that this affects downstream subclasses, but I doubt it, since user dtypes shouldn't really be able to require NEEDS_INIT for anything reasonable.

@eendebakpt
Copy link
Contributor Author

Now that branching happened, I can put you on the trace that our generic objects should probably not have a custom tp_alloc at all ;). There is a tiny chance that this affects downstream subclasses, but I doubt it, since user dtypes shouldn't really be able to require NEEDS_INIT for anything reasonable.

@seberg Not sure I understand this.

@seberg
Copy link
Member

seberg commented May 24, 2022

The np.generic scalar provides a tp_alloc which is inherited by all scalars. This does some unnecessary initialization (only void really needs to initialize).
Removing that tp_alloc() will mean that np.float64, etc. use the default Python allocation directly, which probably removes a tiny bit of overhead.

@eendebakpt
Copy link
Contributor Author

Do you mean we can modify PyGenericArrType_Type.tp_alloc = gentype_alloc here https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/scalartypes.c.src#L3874 and set if to _PyObject_New?

The difference between gentype_alloc and _PyObject_New is allocation of one item more, a memset(obj, 0, size) and a special case for type->tp_itemsize == 0.
Is there are information why those differences are there?

@seberg

@seberg
Copy link
Member

seberg commented May 25, 2022

Hmmm, the generic version is PyType_GenericAlloc I think (i.e. what is used if you just delete that line of code).
Seems that does exactly the same stuff, except that it checks for whether it needs storage for the GC.

I thought I had timed it to make things faster if I remove the custom version. Maybe a fluke, or maybe reusing that version is actually good (because it is used all over the place).

EDIT: I suppose in principle, we could not remove it, but just cut it down a lot (no need to memset or itemsize check for our scalars really)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants