-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: help compilers to auto-vectorize reduction operators #21001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, and probably long overdue especially for the casting! @seiko2plus may have comments about using universal intrinsics, but it may be best to move forward here first and then add that later (at least for the casting part).
If you are up for it, I think it would make sense to split this into two PRs, one for casting and one for the integer reductions, that should make it a bit easier to discuss each change.
(It isn't a big PR though.)
I did not have a super close look yet, but I made some comments about casting and alignment. (Basically, I wonder if this is worthwhile for the unaligned case; and possibly incorrect for that one dead branch).
One thing you can also already do if you like: You could add a release note for performance
if you like, see doc/release/upcoming_changes/README.rst
Hmm, had not noticed the test failure. Might be another reason to split up the two, since I am not sure they come from the reductions or the cast change. It looks like it should be due to the cast change, but it is not clear to me why. (Note that the "benchmark" one is just spurious) EDIT: Ah, the test seems to be |
I saw that too but I found this very weird since I did not really changed the actual computation, just some kind of compiler "hints". Your point about alignment make me thing about either GCC could assume that the input is aligned with AVX2 and simply generate a wrong code for this benchmark. That being said, it would be very surprising since AFAIK GCC should not make such an assumption unless explicitly stated. Otherwise, I guess it could mean the current code has an undefined behavior and using AVX2 make it visible due to different optimizations (which is scary).
It makes sense! This is indeed undefined behavior to cast a float value to an integer that cannot represent the integer part. See this StackOverflow post.
I am fine with this, but how am I supposed to proceed? Should I just need to revert the change on this PR and open a new PR with this one mentioned in it so to merge this one before? |
Sounds good, i.e. remove the casting part of the PR and create anew PR for that (or vice versa, I guess). About the undefined behaviour, I have to think. I am not surprised it behaves differently for UB, but if it was effectively always well defined, so we have to decide that this is a weird corner case that should not matter, or see if there is a trick to preserve behaviour. |
8b6bea3
to
c785851
Compare
c785851
to
3ea71da
Compare
I added a release note and split the PR in two parts. This one still focus on faster reduction operators. The other part about casting is #21123. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit too burnt out to merge right now, but looks good to me. Unless @seiko2plus has any concerns?
It could be nice to add additional benchmarks, but we do have a benchmark for the addition path, so that seems OK since it is all in one chunk.
@@ -0,0 +1,3 @@ | |||
Faster reduction operators | |||
-------------------------- | |||
Reduction operations like `numpy.sum`, `numpy.prod`, `numpy.add.reduce`, `numpy.logical_and.reduce` on contiguous integer-based arrays are now much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to break the long line, but doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Co-authored-by: Matti Picus <matti.picus@gmail.com>
Lets put this in, looks like a nice speed improvement! If the SIMD crows changes this, having the contiguous special case for comparison is probably good also. Thanks @zephyr111. (One day, I hope we can move these into the |
Hi-five on merging your first pull request to NumPy, @zephyr111! We hope you stick around! Your choices aren’t limited to programming – you can review pull requests, help us stay on top of new and old issues, develop educational material, work on our website, add or improve graphic design, create marketing materials, translate website content, write grant proposals, and help with other fundraising initiatives. For more info, check out: https://numpy.org/contribute |
Hello everyone,
This PR is meant to significantly improve the performance of reduction functions (like
numpy.sum
for example): from 2.6 to 20 times faster. It also improves the performance of the casting function used by many Numpy functions. These changes are in response to this Stack Overflow detailed analysis showing that the current code ofnumpy.sum
does not benefit from SIMD instructions (with both integers an floating-point numbers).The PR changes the macros so the compiler can generate a faster code for reductions when the array is contiguous in a way that is similar to the current code in basic operators. It also generates an AVX2 version (if supported) of the conversion function called when the
dtype
of the reduction result type does not match with the one of the array items.Please find below the performance results of
numpy.sum
with an array of 1 MiB on my Linux machine with an Intel i5-9600KF processor. The tests have been made for all the available integer types with and without a customdtype
argument set to the arraydtype
value (it is eitherint64
oruint64
by default on my machine).As you can see, it significantly improves the execution time (by a factor up to ~20x). This is especially the case when a
dtype
argument is provided since a (slow sub-optimal) conversion function is applied otherwise (as described in the above detailed analysis link).Similar improvements applies to functions like
np.prod
,np.add.reduce
,np.subtract.reduce
,np.logical_and.reduce
, etc.It turns out that improving the casting function also results in a substantial faster execution time of many functions like
np.cumsum
,np.cumprod
,np.all
andnp.any
. Even functions likenp.average
,np.mean
or basic multiplications/additions/subtractions by a constant of a different type requiring the array to be casted are a bit faster with an AVX2 casting function.