PERF: Use python integer on _count_reduce_items #21465

seberg · 2022-05-06T12:10:32Z

I am half curious now what would happen if we use items = nt.intp(items) at the very end. Adds the overhead of creating the scalar, but besides the unfortunate __array_ufunc__ overhead, it might actually be faster and also a bit cleaner.

IIRC, that code was a tricky to get right for all the corner cases. That type conversion was probably there for a reason,

Yeah, seems later code expects a NumPy object (with .ndim attribute) in some branches. gh-20175 might help a bit here in any case (which would be merged soon).

Otherwise, would have to check if it is still worthwhile to convert later, or the using code can be refactored easily. It may well make sense to convert to np.array(items, dtype=np.intp), here actually!

This one is tricky to get right. We seem to agree the calculation inside _count_reduce_items can be done in plain int, but the output should be in nt.intp or nt.ndarray(..., dtype=nt.intp). The reason is that if the where argument is not True the output of _count_reduce_items is not a scalar but an array.

Some timings for the calculation (the third one is the suggestion by seberg)

main: 1.78 this PR: 1.37 this PR with case to nt.intp: 1.70 this PR with final case to nt.array(..., dtype=nt.intp): 1.74

So unfortunately, the main performance win is because the true_divide in _mean is not very efficient for numpy type inputs, and it is for int type input. Benchmark to confirm:

from numpy.core import umath as um x=np.random.rand(10,) v=5 %timeit um.true_divide(x, v) v=np.intp(5) %timeit um.true_divide(x, v) v=np.array([5], dtype=np.intp) %timeit um.true_divide(x, v)

results in

1.24 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.6 µs ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.5 µs ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Casting the result to nt.intp and casting back to int in the case where!=True and rcount.ndim==0 is not solution. This code:

items = 1 for ax in axis: items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)] items = nt.intp(items) items=int(items) ...

(which does not even handle the where case!) already loses performance.

The best solution would be to make the true_divide also efficient for np.intp input, but that is not the scope of this PR.

I updated the PR to keep the output of _count_reduce_items an int, but handle the corner cases and make the tests pass. The performance gain is still there, and the code is not much worse, so it might be worthwhile to merge. The alternative is to put this PR on hold and see first whether we can improve the performance with other means.

One of the corner cases is input of an array with dtype=object. If rcount is not cast to nt.intp, we end up with a pure integer division which results in a scalar and not an array.

rcount = 2 rcount = 2 mat = array([5]) # x=np.add.reduce(mat, axis=None) print(x/rcount, type(x/rcount)) mat = array([5], dtype=object) x=np.add.reduce(mat, axis=None) print(x/rcount, type(x/rcount))

Output:

2.5 <class 'numpy.float64'> 2.5 <class 'float'>

@eendebakpt you may want to apply gh-21188 before spending too much time on certain things. If np.float64(3.) / 3 is much faster than np.float64(3.)/intp(3), than that PR should just fix that part at least.
(The PR only affect scalar code paths though)

@seberg With recent PRs the performance gap is less, so no need for a special case with int any more. There is still a performance gap, so if you could enable the fast paths you mentioned or perform the "fixing" that would be great.

I was thinking of this diff:

diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c index 7fd436a9d..f340ff077 100644 --- a/numpy/core/src/multiarray/convert_datatype.c +++ b/numpy/core/src/multiarray/convert_datatype.c @@ -497,6 +497,20 @@ PyArray_CheckCastSafety(NPY_CASTING casting, if (to != NULL) { to_dtype = NPY_DTYPE(to); } + + /* + * Fast path for builtin NumPy dtypes. This may not be worthwhile overall, + * but in some cases the legacy ufunc loop kicks in and cannot be cached + * (this happens e.g. for `arr / scalar`) and then we might check a couple + * of casts to search the right loop. + */ + if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE + && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE + && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting + && _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]) { + return 1; + } + PyObject *meth = PyArray_GetCastingImpl(NPY_DTYPE(from), to_dtype); if (meth == NULL) { return -1;

because I thought a good deal of time in the type resolver ended up being spend here. But it doesn't seem to make a huge difference now, so not sure it is actually worth it right now (have to check more).

OK, that diff is wrong, because the important part is the one where 0 is returned (the path can be adapted for the interesting case by checking casting == NPY_SAFE_CASTING and then returning the table value. But, I am still not really seeing any worthwhile speedup, but maybe I am just testing the wrong thing tonight.

EDIT: Fixed diff (not that it matters much, even for arr / 0darr or arr + 0darr I don't really see it making a big difference:

if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE) { npy_bool safe; safe = _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]; if (casting == NPY_SAFE_CASTING) { return safe; } if (safe && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting) { return 1; } }

@seberg I found the path that triggers the legacy type resolution. For both np.aray([1., 2.])/10 and np.aray([1., 2.])/np.intp(10).

The division ends up in ufunc_generic_fastcall which calls convert_ufunc_arguments

convert_ufunc_arguments is called with nin=2, nout=1 and allow_legacy_promotion=1. First argument is a float64 array, second a float64 scalar.

In convert_ufunc_arguments this results inall_scalar=0, any_scalar=1.

Then this piece of code

/* * TODO: we need to special case scalars here, if the input is a * Python int, float, or complex, we have to use the "weak" * DTypes: `PyArray_PyIntAbstractDType`, etc. * This is to allow e.g. `float32(1.) + 1` to return `float32`. * The correct array dtype can only be found after promotion for * such a "weak scalar". We could avoid conversion here, but * must convert it for use in the legacy promotion. * There is still a small chance that this logic can instead * happen inside the Python operators. */ } if (*allow_legacy_promotion && (!all_scalar && any_scalar)) { *force_legacy_promotion = should_use_min_scalar(nin, out_op, 0, NULL); }

sets force_legacy_promotion to one.

Is this something we can improve?

Yes, but I wrote a 2800 line file on that general problem ;).
More seriously: on the path towards the above/TODO, there may be a middle ground that just makes the "obvious" paths fast (this one happens to be obvious).

I.e. once the TODO is half gone, force_legacy_promotion could probably be removed as long as we take some care later (and make the paths where it has to kick in fast enough!). That may fall out, but maybe it would just add complexity. So overall, I wouldn't really aim for speeding things up, except by saying: sure hopefully that NEP will be implemented soon enough, and then we can make sure things are fast.

-Original file line number
+Diff line change
@@ Expand Up @@
                 axis = tuple(range(arr.ndim))
             elif not isinstance(axis, tuple):
                 axis = (axis,)
-            items = nt.intp(1)
+            items = 1
             for ax in axis:
                 items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
+            items = nt.intp(items)
         else:
             # TODO: Optimize case when `where` is broadcast along a non-reduction
             # axis and full sum is more excessive than needed.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Use python integer on _count_reduce_items #21465

Diff view

Diff view

There are no files selected for viewing

seberg May 6, 2022

charris May 6, 2022

seberg May 6, 2022

eendebakpt May 6, 2022

seberg May 6, 2022

eendebakpt May 8, 2022

seberg May 8, 2022

seberg May 8, 2022 •

edited

Loading

eendebakpt May 12, 2022

seberg May 12, 2022

PERF: Use python integer on _count_reduce_items #21465

PERF: Use python integer on _count_reduce_items #21465

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg May 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg May 8, 2022 •

edited

Loading