-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: np.mean is slower than np.sum + a division #21455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
About the last two points, the problem is that we are dividing by an integer scalar? So there is a cast necessary and the promotion isn't quite trivial unfortunately (also because it is a scalar, so the legacy code kicks in, which inspects the value :/, trying to push that out, but its slow progress). Now, I admit, I think some of those paths that need to fall back got slower when scalars were involved, maybe that is part of what you see now. (I remember there was one path that got slower, while most others got faster, but I forgot which it was) Te overlap check, it is behind a A fun way to improve the It is good to look at this closer, but |
Interesting, when |
Just to confirm that: diff --git a/numpy/core/_methods.py b/numpy/core/_methods.py
index a239e2c87..b6ec15714 100644
--- a/numpy/core/_methods.py
+++ b/numpy/core/_methods.py
@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True):
axis = tuple(range(arr.ndim))
elif not isinstance(axis, tuple):
axis = (axis,)
- items = nt.intp(1)
+ items = 1
for ax in axis:
items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
else: speeds things up quite a lot in your example. I do not know if scalar math is involved here (or how much), so my scalar-math update may help. But I suspect this also helps for the overrides etc. |
The method `PyUFuncOverride_GetNonDefaultArrayUfunc` is expensive on numpy scalars because these objects do not have a `__array_ufunc__` set and for a missing attribute lookup cpython generates an exception that is later cleared by numpy. This is a performance bottleneck, see #21455. An issue has been submitted to cpython (python/cpython#92216). But even if this is addressed in cpython, it will take untill python 3.12+ before this will be useable by numpy. As an alternative solution, this PR adds a fast path to `PyUFuncOverride_GetNonDefaultArrayUfunc` to determine whether an object is a numpy scalar.
With the PRs linking to this issue merged the difference is smaller. There are two more items that can be addressed to reduce the difference:
Closing this issue |
…E_OVERLAP_OK (#21464) Addresses one of the items in #21455 * fast check on equivalent arrays in PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK * fix for case stride1==0 * remove assert statements * whitespace * DOC: Add comment that output broadcasting is rejected by overlap detection That is a bit of a weird dynamic here, and it would probably nice to clean up. However, it is also not a reason to not add the fast-path right now. --- Note that it may also be good to just use a data-pointer comparison or push this fast-path into the overlap detection function itself.
Now that branching happened, I can put you on the trace that our |
@seberg Not sure I understand this. |
The |
Do you mean we can modify The difference between |
Hmmm, the generic version is I thought I had timed it to make things faster if I remove the custom version. Maybe a fluke, or maybe reusing that version is actually good (because it is used all over the place). EDIT: I suppose in principle, we could not remove it, but just cut it down a lot (no need to memset or |
Describe the issue:
Using
np.mean
takes longer thannp.sum
and then manually dividing by the number of elements.Benchmark result of
np.mean(x)
vs.np.sum(x)/x.size
:(benchmark itself in the reproduce code block below)
Analysis
Using valgrind I did some profiling on
np.mean
.Below the
ufunc_generic_fastcall
there are large 4 blocks that I analyse below.There is an expensive call to
solve_may_share_memory
(viaufunc_generic_fastcall -> PyUFunc_GenericFunctionInternal -> try_trivial_single_output_loop -> PyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK -> solve_may_share_memory
). This is for a ufunc calleddivide
(the second step in themean
calculation).But the call to the
ufunc_generic_fastcall
was without keyword arguments andout_is_passed_by_position=0
. So a new output array is allocated for the result and this cannot overlap with the input arrays. If this is really the case, can we pass this knowledge totry_trivial_single_output_loop
somehow to prevent thePyArray_EQUIVALENTLY_ITERABLE_OVERLAP_OK
check?The
PyUFunc_CheckOverride
is expensive due to the__array_ufunc__
attribute missing. There are multiple ways of improving this, but there is already an issue at cpython: Performance of attribute lookup for type objects python/cpython#92216 than might help hereResolving
PyUFunc_TrueDivisionTypeResolver
takes some time. The method is fromufunc_type_resolution.c
, which is marked as legacy code. Can we avoid getting at this point? It is called frompromote_and_get_info_and_ufuncimpl
(dispatching.c
). In the code it is noted that"Using promotion failed, this should normally be an error.". What is happening there? All types involves are
float64
I guess, so it should not be hard to find the rightufunc implementation
.Why the PyArray_CastToType? It is called from
check_for_trivial_loop
wheremust_copy
isTrue
. The array is small so a copy is made@seberg
Reproduce the code example:
Error message:
No response
NumPy/Python version information:
The text was updated successfully, but these errors were encountered: