Skip to content

PERF: Use python integer on _count_reduce_items #21465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 11, 2022

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented May 6, 2022

  • The calculation in _count_reduce_items is faster with a plain python in than a numpy scalar np.inp. The elements of arr.shape are plain integers.
  • The usage of the result (division in np.mean) also might be faster with int instead of np.intp. (although in the long run in might be better to make the division with np.intp just as fast as with int)

Microbenchmark

import numpy as np
import time

x=np.random.rand(10,20)   
niter=20_0000

t0=time.perf_counter()
for ii in range(niter):
    y=np.mean(x, axis=1)
dt=time.perf_counter()-t0
print(dt)

Results in:

main: 1.78
PR: 1.39

Addresses one of the items in #21455

@eendebakpt eendebakpt marked this pull request as draft May 6, 2022 11:56
@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True):
axis = tuple(range(arr.ndim))
elif not isinstance(axis, tuple):
axis = (axis,)
items = nt.intp(1)
items = 1
for ax in axis:
items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am half curious now what would happen if we use items = nt.intp(items) at the very end. Adds the overhead of creating the scalar, but besides the unfortunate __array_ufunc__ overhead, it might actually be faster and also a bit cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, that code was a tricky to get right for all the corner cases. That type conversion was probably there for a reason,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems later code expects a NumPy object (with .ndim attribute) in some branches. gh-20175 might help a bit here in any case (which would be merged soon).

Otherwise, would have to check if it is still worthwhile to convert later, or the using code can be refactored easily. It may well make sense to convert to np.array(items, dtype=np.intp), here actually!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is tricky to get right. We seem to agree the calculation inside _count_reduce_items can be done in plain int, but the output should be in nt.intp or nt.ndarray(..., dtype=nt.intp). The reason is that if the where argument is not True the output of _count_reduce_items is not a scalar but an array.

Some timings for the calculation (the third one is the suggestion by seberg)

main: 1.78
this PR: 1.37
this PR with case to nt.intp: 1.70
this PR with final case to nt.array(..., dtype=nt.intp): 1.74

So unfortunately, the main performance win is because the true_divide in _mean is not very efficient for numpy type inputs, and it is for int type input. Benchmark to confirm:

from numpy.core import umath as um
x=np.random.rand(10,)   

v=5
%timeit um.true_divide(x, v)
v=np.intp(5)
%timeit um.true_divide(x, v)
v=np.array([5], dtype=np.intp)
%timeit um.true_divide(x, v)

results in

1.24 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.6 µs ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.5 µs ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Casting the result to nt.intp and casting back to int in the case where!=True and rcount.ndim==0 is not solution. This code:

        items = 1
        for ax in axis:
            items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
        items = nt.intp(items)
        items=int(items)
...

(which does not even handle the where case!) already loses performance.

The best solution would be to make the true_divide also efficient for np.intp input, but that is not the scope of this PR.

I updated the PR to keep the output of _count_reduce_items an int, but handle the corner cases and make the tests pass. The performance gain is still there, and the code is not much worse, so it might be worthwhile to merge. The alternative is to put this PR on hold and see first whether we can improve the performance with other means.

One of the corner cases is input of an array with dtype=object. If rcount is not cast to nt.intp, we end up with a pure integer division which results in a scalar and not an array.

rcount = 2
rcount = 2
mat = array([5]) #
x=np.add.reduce(mat, axis=None)
print(x/rcount, type(x/rcount))

mat = array([5], dtype=object)
x=np.add.reduce(mat, axis=None)
print(x/rcount, type(x/rcount))

Output:

2.5 <class 'numpy.float64'>
2.5 <class 'float'>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eendebakpt you may want to apply gh-21188 before spending too much time on certain things. If np.float64(3.) / 3 is much faster than np.float64(3.)/intp(3), than that PR should just fix that part at least.
(The PR only affect scalar code paths though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg With recent PRs the performance gap is less, so no need for a special case with int any more. There is still a performance gap, so if you could enable the fast paths you mentioned or perform the "fixing" that would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of this diff:

diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c
index 7fd436a9d..f340ff077 100644
--- a/numpy/core/src/multiarray/convert_datatype.c
+++ b/numpy/core/src/multiarray/convert_datatype.c
@@ -497,6 +497,20 @@ PyArray_CheckCastSafety(NPY_CASTING casting,
     if (to != NULL) {
         to_dtype = NPY_DTYPE(to);
     }
+
+    /*
+     * Fast path for builtin NumPy dtypes.  This may not be worthwhile overall,
+     * but in some cases the legacy ufunc loop kicks in and cannot be cached
+     * (this happens e.g. for `arr / scalar`) and then we might check a couple
+     * of casts to search the right loop.
+     */
+    if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE
+            && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE
+            && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting
+            && _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]) {
+        return 1;
+    }
+
     PyObject *meth = PyArray_GetCastingImpl(NPY_DTYPE(from), to_dtype);
     if (meth == NULL) {
         return -1;

because I thought a good deal of time in the type resolver ended up being spend here. But it doesn't seem to make a huge difference now, so not sure it is actually worth it right now (have to check more).

Copy link
Member

@seberg seberg May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that diff is wrong, because the important part is the one where 0 is returned (the path can be adapted for the interesting case by checking casting == NPY_SAFE_CASTING and then returning the table value. But, I am still not really seeing any worthwhile speedup, but maybe I am just testing the wrong thing tonight.

EDIT: Fixed diff (not that it matters much, even for arr / 0darr or arr + 0darr I don't really see it making a big difference:

    if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE
            && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE) {
        npy_bool safe;
        safe = _npy_can_cast_safely_table[from->type_num][to_dtype->type_num];
        if (casting == NPY_SAFE_CASTING) {
            return safe;
        }
        if (safe && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting) {
            return 1;
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I found the path that triggers the legacy type resolution. For both np.aray([1., 2.])/10 and np.aray([1., 2.])/np.intp(10).

  1. The division ends up in ufunc_generic_fastcall which calls convert_ufunc_arguments
  2. convert_ufunc_arguments is called with nin=2, nout=1 and allow_legacy_promotion=1. First argument is a float64 array, second a float64 scalar.
  3. In convert_ufunc_arguments this results inall_scalar=0, any_scalar=1.
  4. Then this piece of code
        /*
         * TODO: we need to special case scalars here, if the input is a
         *       Python int, float, or complex, we have to use the "weak"
         *       DTypes: `PyArray_PyIntAbstractDType`, etc.
         *       This is to allow e.g. `float32(1.) + 1` to return `float32`.
         *       The correct array dtype can only be found after promotion for
         *       such a "weak scalar".  We could avoid conversion here, but
         *       must convert it for use in the legacy promotion.
         *       There is still a small chance that this logic can instead
         *       happen inside the Python operators.
         */
    }
    if (*allow_legacy_promotion && (!all_scalar && any_scalar)) {
        *force_legacy_promotion = should_use_min_scalar(nin, out_op, 0, NULL);
    }

sets force_legacy_promotion to one.

Is this something we can improve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I wrote a 2800 line file on that general problem ;).
More seriously: on the path towards the above/TODO, there may be a middle ground that just makes the "obvious" paths fast (this one happens to be obvious).

I.e. once the TODO is half gone, force_legacy_promotion could probably be removed as long as we take some care later (and make the paths where it has to kick in fast enough!). That may fall out, but maybe it would just add complexity. So overall, I wouldn't really aim for speeding things up, except by saying: sure hopefully that NEP will be implemented soon enough, and then we can make sure things are fast.

@eendebakpt eendebakpt force-pushed the _count_reduce_items branch 4 times, most recently from bac0b0f to 2090c8a Compare May 8, 2022 20:24
@eendebakpt eendebakpt marked this pull request as ready for review May 8, 2022 20:27
@eendebakpt eendebakpt force-pushed the _count_reduce_items branch from 2090c8a to 1ca74a3 Compare May 11, 2022 08:16
@seberg seberg changed the title PERF Use python integer on _count_reduce_items PERF: Use python integer on _count_reduce_items May 11, 2022
@seberg
Copy link
Member

seberg commented May 11, 2022

Happy to put this in, just to confirm that you think it is still wortwhile?

@eendebakpt
Copy link
Contributor Author

@seberg To be fair: it does not make a large performance difference any more.

But the items in arr.shape are plain python int, so this seems cleaner to me.

@seberg
Copy link
Member

seberg commented May 11, 2022

OK, I will put this in (after tests pass). But if someone does the opposite PR, I will put that in also probably ;).

@seberg seberg merged commit 42dc653 into numpy:main May 11, 2022
@eendebakpt eendebakpt deleted the _count_reduce_items branch May 11, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants