PERF: Use python integer on _count_reduce_items #21465

eendebakpt · 2022-05-06T11:56:41Z

The calculation in _count_reduce_items is faster with a plain python in than a numpy scalar np.inp. The elements of arr.shape are plain integers.
The usage of the result (division in np.mean) also might be faster with int instead of np.intp. (although in the long run in might be better to make the division with np.intp just as fast as with int)

Microbenchmark

import numpy as np
import time

x=np.random.rand(10,20)   
niter=20_0000

t0=time.perf_counter()
for ii in range(niter):
    y=np.mean(x, axis=1)
dt=time.perf_counter()-t0
print(dt)

Results in:

main: 1.78
PR: 1.39

Addresses one of the items in #21455

seberg · 2022-05-06T12:10:32Z

numpy/core/_methods.py

@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True):
            axis = tuple(range(arr.ndim))
        elif not isinstance(axis, tuple):
            axis = (axis,)
-        items = nt.intp(1)
+        items = 1
        for ax in axis:
            items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]


I am half curious now what would happen if we use items = nt.intp(items) at the very end. Adds the overhead of creating the scalar, but besides the unfortunate __array_ufunc__ overhead, it might actually be faster and also a bit cleaner.

IIRC, that code was a tricky to get right for all the corner cases. That type conversion was probably there for a reason,

Yeah, seems later code expects a NumPy object (with .ndim attribute) in some branches. gh-20175 might help a bit here in any case (which would be merged soon).

Otherwise, would have to check if it is still worthwhile to convert later, or the using code can be refactored easily. It may well make sense to convert to np.array(items, dtype=np.intp), here actually!

This one is tricky to get right. We seem to agree the calculation inside _count_reduce_items can be done in plain int, but the output should be in nt.intp or nt.ndarray(..., dtype=nt.intp). The reason is that if the where argument is not True the output of _count_reduce_items is not a scalar but an array.

Some timings for the calculation (the third one is the suggestion by seberg)

main: 1.78 this PR: 1.37 this PR with case to nt.intp: 1.70 this PR with final case to nt.array(..., dtype=nt.intp): 1.74

So unfortunately, the main performance win is because the true_divide in _mean is not very efficient for numpy type inputs, and it is for int type input. Benchmark to confirm:

from numpy.core import umath as um x=np.random.rand(10,) v=5 %timeit um.true_divide(x, v) v=np.intp(5) %timeit um.true_divide(x, v) v=np.array([5], dtype=np.intp) %timeit um.true_divide(x, v)

results in

1.24 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.6 µs ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.5 µs ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Casting the result to nt.intp and casting back to int in the case where!=True and rcount.ndim==0 is not solution. This code:

items = 1 for ax in axis: items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)] items = nt.intp(items) items=int(items) ...

(which does not even handle the where case!) already loses performance.

The best solution would be to make the true_divide also efficient for np.intp input, but that is not the scope of this PR.

I updated the PR to keep the output of _count_reduce_items an int, but handle the corner cases and make the tests pass. The performance gain is still there, and the code is not much worse, so it might be worthwhile to merge. The alternative is to put this PR on hold and see first whether we can improve the performance with other means.

One of the corner cases is input of an array with dtype=object. If rcount is not cast to nt.intp, we end up with a pure integer division which results in a scalar and not an array.

rcount = 2 rcount = 2 mat = array([5]) # x=np.add.reduce(mat, axis=None) print(x/rcount, type(x/rcount)) mat = array([5], dtype=object) x=np.add.reduce(mat, axis=None) print(x/rcount, type(x/rcount))

Output:

2.5 <class 'numpy.float64'> 2.5 <class 'float'>

@eendebakpt you may want to apply gh-21188 before spending too much time on certain things. If np.float64(3.) / 3 is much faster than np.float64(3.)/intp(3), than that PR should just fix that part at least.
(The PR only affect scalar code paths though)

@seberg With recent PRs the performance gap is less, so no need for a special case with int any more. There is still a performance gap, so if you could enable the fast paths you mentioned or perform the "fixing" that would be great.

I was thinking of this diff:

diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c index 7fd436a9d..f340ff077 100644 --- a/numpy/core/src/multiarray/convert_datatype.c +++ b/numpy/core/src/multiarray/convert_datatype.c @@ -497,6 +497,20 @@ PyArray_CheckCastSafety(NPY_CASTING casting, if (to != NULL) { to_dtype = NPY_DTYPE(to); } + + /* + * Fast path for builtin NumPy dtypes. This may not be worthwhile overall, + * but in some cases the legacy ufunc loop kicks in and cannot be cached + * (this happens e.g. for `arr / scalar`) and then we might check a couple + * of casts to search the right loop. + */ + if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE + && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE + && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting + && _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]) { + return 1; + } + PyObject *meth = PyArray_GetCastingImpl(NPY_DTYPE(from), to_dtype); if (meth == NULL) { return -1;

because I thought a good deal of time in the type resolver ended up being spend here. But it doesn't seem to make a huge difference now, so not sure it is actually worth it right now (have to check more).

OK, that diff is wrong, because the important part is the one where 0 is returned (the path can be adapted for the interesting case by checking casting == NPY_SAFE_CASTING and then returning the table value. But, I am still not really seeing any worthwhile speedup, but maybe I am just testing the wrong thing tonight.

EDIT: Fixed diff (not that it matters much, even for arr / 0darr or arr + 0darr I don't really see it making a big difference:

if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE) { npy_bool safe; safe = _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]; if (casting == NPY_SAFE_CASTING) { return safe; } if (safe && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting) { return 1; } }

@seberg I found the path that triggers the legacy type resolution. For both np.aray([1., 2.])/10 and np.aray([1., 2.])/np.intp(10).

The division ends up in ufunc_generic_fastcall which calls convert_ufunc_arguments

convert_ufunc_arguments is called with nin=2, nout=1 and allow_legacy_promotion=1. First argument is a float64 array, second a float64 scalar.

In convert_ufunc_arguments this results inall_scalar=0, any_scalar=1.

Then this piece of code

/* * TODO: we need to special case scalars here, if the input is a * Python int, float, or complex, we have to use the "weak" * DTypes: `PyArray_PyIntAbstractDType`, etc. * This is to allow e.g. `float32(1.) + 1` to return `float32`. * The correct array dtype can only be found after promotion for * such a "weak scalar". We could avoid conversion here, but * must convert it for use in the legacy promotion. * There is still a small chance that this logic can instead * happen inside the Python operators. */ } if (*allow_legacy_promotion && (!all_scalar && any_scalar)) { *force_legacy_promotion = should_use_min_scalar(nin, out_op, 0, NULL); }

sets force_legacy_promotion to one.

Is this something we can improve?

Yes, but I wrote a 2800 line file on that general problem ;).
More seriously: on the path towards the above/TODO, there may be a middle ground that just makes the "obvious" paths fast (this one happens to be obvious).

I.e. once the TODO is half gone, force_legacy_promotion could probably be removed as long as we take some care later (and make the paths where it has to kick in fast enough!). That may fall out, but maybe it would just add complexity. So overall, I wouldn't really aim for speeding things up, except by saying: sure hopefully that NEP will be implemented soon enough, and then we can make sure things are fast.

seberg · 2022-05-11T09:05:18Z

Happy to put this in, just to confirm that you think it is still wortwhile?

eendebakpt · 2022-05-11T09:19:47Z

@seberg To be fair: it does not make a large performance difference any more.

But the items in arr.shape are plain python int, so this seems cleaner to me.

seberg · 2022-05-11T09:22:26Z

OK, I will put this in (after tests pass). But if someone does the opposite PR, I will put that in also probably ;).

eendebakpt marked this pull request as draft May 6, 2022 11:56

seberg reviewed May 6, 2022

View reviewed changes

eendebakpt force-pushed the _count_reduce_items branch 4 times, most recently from bac0b0f to 2090c8a Compare May 8, 2022 20:24

eendebakpt marked this pull request as ready for review May 8, 2022 20:27

eendebakpt added 4 commits May 11, 2022 10:16

PERF Use python integer on _count_reduce_items

98ec639

fix for case that where argument is set

2514e21

fix corner cases

8b35ede

output of _count_reduce_items back to intp

1ca74a3

eendebakpt force-pushed the _count_reduce_items branch from 2090c8a to 1ca74a3 Compare May 11, 2022 08:16

seberg changed the title ~~PERF Use python integer on _count_reduce_items~~ PERF: Use python integer on _count_reduce_items May 11, 2022

seberg merged commit 42dc653 into numpy:main May 11, 2022

eendebakpt deleted the _count_reduce_items branch May 11, 2022 10:14

eendebakpt mentioned this pull request May 12, 2022

BUG: np.mean is slower than np.sum + a division #21455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Use python integer on _count_reduce_items #21465

PERF: Use python integer on _count_reduce_items #21465

eendebakpt commented May 6, 2022 •

edited

Loading

seberg May 6, 2022

charris May 6, 2022

seberg May 6, 2022

eendebakpt May 6, 2022

seberg May 6, 2022

eendebakpt May 8, 2022

seberg May 8, 2022

seberg May 8, 2022 •

edited

Loading

eendebakpt May 12, 2022

seberg May 12, 2022

seberg commented May 11, 2022

eendebakpt commented May 11, 2022

seberg commented May 11, 2022

PERF: Use python integer on _count_reduce_items #21465

PERF: Use python integer on _count_reduce_items #21465

Conversation

eendebakpt commented May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg May 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented May 11, 2022

eendebakpt commented May 11, 2022

seberg commented May 11, 2022

eendebakpt commented May 6, 2022 •

edited

Loading

seberg May 8, 2022 •

edited

Loading