-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
PERF: Use python integer on _count_reduce_items #21465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -71,7 +71,7 @@ def _count_reduce_items(arr, axis, keepdims=False, where=True): | |||
axis = tuple(range(arr.ndim)) | |||
elif not isinstance(axis, tuple): | |||
axis = (axis,) | |||
items = nt.intp(1) | |||
items = 1 | |||
for ax in axis: | |||
items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am half curious now what would happen if we use items = nt.intp(items)
at the very end. Adds the overhead of creating the scalar, but besides the unfortunate __array_ufunc__
overhead, it might actually be faster and also a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, that code was a tricky to get right for all the corner cases. That type conversion was probably there for a reason,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems later code expects a NumPy object (with .ndim
attribute) in some branches. gh-20175 might help a bit here in any case (which would be merged soon).
Otherwise, would have to check if it is still worthwhile to convert later, or the using code can be refactored easily. It may well make sense to convert to np.array(items, dtype=np.intp)
, here actually!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is tricky to get right. We seem to agree the calculation inside _count_reduce_items
can be done in plain int
, but the output should be in nt.intp
or nt.ndarray(..., dtype=nt.intp)
. The reason is that if the where
argument is not True
the output of _count_reduce_items
is not a scalar but an array.
Some timings for the calculation (the third one is the suggestion by seberg)
main: 1.78
this PR: 1.37
this PR with case to nt.intp: 1.70
this PR with final case to nt.array(..., dtype=nt.intp): 1.74
So unfortunately, the main performance win is because the true_divide
in _mean
is not very efficient for numpy type inputs, and it is for int
type input. Benchmark to confirm:
from numpy.core import umath as um
x=np.random.rand(10,)
v=5
%timeit um.true_divide(x, v)
v=np.intp(5)
%timeit um.true_divide(x, v)
v=np.array([5], dtype=np.intp)
%timeit um.true_divide(x, v)
results in
1.24 µs ± 9.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.6 µs ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.5 µs ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Casting the result to nt.intp
and casting back to int
in the case where!=True
and rcount.ndim==0
is not solution. This code:
items = 1
for ax in axis:
items *= arr.shape[mu.normalize_axis_index(ax, arr.ndim)]
items = nt.intp(items)
items=int(items)
...
(which does not even handle the where
case!) already loses performance.
The best solution would be to make the true_divide
also efficient for np.intp
input, but that is not the scope of this PR.
I updated the PR to keep the output of _count_reduce_items
an int
, but handle the corner cases and make the tests pass. The performance gain is still there, and the code is not much worse, so it might be worthwhile to merge. The alternative is to put this PR on hold and see first whether we can improve the performance with other means.
One of the corner cases is input of an array with dtype=object
. If rcount
is not cast to nt.intp
, we end up with a pure integer division which results in a scalar and not an array.
rcount = 2
rcount = 2
mat = array([5]) #
x=np.add.reduce(mat, axis=None)
print(x/rcount, type(x/rcount))
mat = array([5], dtype=object)
x=np.add.reduce(mat, axis=None)
print(x/rcount, type(x/rcount))
Output:
2.5 <class 'numpy.float64'>
2.5 <class 'float'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eendebakpt you may want to apply gh-21188 before spending too much time on certain things. If np.float64(3.) / 3
is much faster than np.float64(3.)/intp(3)
, than that PR should just fix that part at least.
(The PR only affect scalar code paths though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg With recent PRs the performance gap is less, so no need for a special case with int
any more. There is still a performance gap, so if you could enable the fast paths you mentioned or perform the "fixing" that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this diff:
diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c
index 7fd436a9d..f340ff077 100644
--- a/numpy/core/src/multiarray/convert_datatype.c
+++ b/numpy/core/src/multiarray/convert_datatype.c
@@ -497,6 +497,20 @@ PyArray_CheckCastSafety(NPY_CASTING casting,
if (to != NULL) {
to_dtype = NPY_DTYPE(to);
}
+
+ /*
+ * Fast path for builtin NumPy dtypes. This may not be worthwhile overall,
+ * but in some cases the legacy ufunc loop kicks in and cannot be cached
+ * (this happens e.g. for `arr / scalar`) and then we might check a couple
+ * of casts to search the right loop.
+ */
+ if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE
+ && (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE
+ && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting
+ && _npy_can_cast_safely_table[from->type_num][to_dtype->type_num]) {
+ return 1;
+ }
+
PyObject *meth = PyArray_GetCastingImpl(NPY_DTYPE(from), to_dtype);
if (meth == NULL) {
return -1;
because I thought a good deal of time in the type resolver ended up being spend here. But it doesn't seem to make a huge difference now, so not sure it is actually worth it right now (have to check more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that diff is wrong, because the important part is the one where 0
is returned (the path can be adapted for the interesting case by checking casting == NPY_SAFE_CASTING
and then returning the table value. But, I am still not really seeing any worthwhile speedup, but maybe I am just testing the wrong thing tonight.
EDIT: Fixed diff (not that it matters much, even for arr / 0darr
or arr + 0darr
I don't really see it making a big difference:
if ((unsigned int)(from->type_num) <= NPY_CLONGDOUBLE
&& (unsigned int)(to_dtype->type_num) <= NPY_CLONGDOUBLE) {
npy_bool safe;
safe = _npy_can_cast_safely_table[from->type_num][to_dtype->type_num];
if (casting == NPY_SAFE_CASTING) {
return safe;
}
if (safe && PyArray_MinCastSafety(casting, NPY_SAFE_CASTING) == casting) {
return 1;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg I found the path that triggers the legacy type resolution. For both np.aray([1., 2.])/10
and np.aray([1., 2.])/np.intp(10)
.
- The division ends up in
ufunc_generic_fastcall
which callsconvert_ufunc_arguments
convert_ufunc_arguments
is called withnin=2
,nout=1
andallow_legacy_promotion=1
. First argument is a float64 array, second a float64 scalar.- In
convert_ufunc_arguments
this results inall_scalar=0, any_scalar=1
. - Then this piece of code
/*
* TODO: we need to special case scalars here, if the input is a
* Python int, float, or complex, we have to use the "weak"
* DTypes: `PyArray_PyIntAbstractDType`, etc.
* This is to allow e.g. `float32(1.) + 1` to return `float32`.
* The correct array dtype can only be found after promotion for
* such a "weak scalar". We could avoid conversion here, but
* must convert it for use in the legacy promotion.
* There is still a small chance that this logic can instead
* happen inside the Python operators.
*/
}
if (*allow_legacy_promotion && (!all_scalar && any_scalar)) {
*force_legacy_promotion = should_use_min_scalar(nin, out_op, 0, NULL);
}
sets force_legacy_promotion
to one.
Is this something we can improve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I wrote a 2800 line file on that general problem ;).
More seriously: on the path towards the above/TODO, there may be a middle ground that just makes the "obvious" paths fast (this one happens to be obvious).
I.e. once the TODO is half gone, force_legacy_promotion
could probably be removed as long as we take some care later (and make the paths where it has to kick in fast enough!). That may fall out, but maybe it would just add complexity. So overall, I wouldn't really aim for speeding things up, except by saying: sure hopefully that NEP will be implemented soon enough, and then we can make sure things are fast.
bac0b0f
to
2090c8a
Compare
2090c8a
to
1ca74a3
Compare
Happy to put this in, just to confirm that you think it is still wortwhile? |
@seberg To be fair: it does not make a large performance difference any more. But the items in |
OK, I will put this in (after tests pass). But if someone does the opposite PR, I will put that in also probably ;). |
_count_reduce_items
is faster with a plain python in than a numpy scalarnp.inp
. The elements ofarr.shape
are plain integers.np.mean
) also might be faster withint
instead ofnp.intp
. (although in the long run in might be better to make the division withnp.intp
just as fast as withint
)Microbenchmark
Results in:
Addresses one of the items in #21455