-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Add SIMD operation copysign #19770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c466922
to
5bc0d7a
Compare
ae737db
to
bde468f
Compare
Hi @rgommers I have added |
Hi @rgommers, I have appended the result of benchmark ran in CI. Please take a look. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @howjmay. The speedup is only ~10% so that seems a bit marginal, but on the other the code complexity here isn't very high. So it may be a decent tradeoff - it's not quite clear to me. Any opinions @seiko2plus, @Qiyu8, @mattip?
benchmarks/benchmarks/bench_ufunc.py
Outdated
@@ -81,6 +81,8 @@ def setup(self): | |||
self.i = np.ones(150000, dtype=np.int32) | |||
self.f = np.zeros(150000, dtype=np.float32) | |||
self.d = np.zeros(75000, dtype=np.float64) | |||
|
|||
self.tf = np.ones(150000, dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as self.i
, so you can reuse that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to point it out, but you do not want to use an integer array in the test. This will trigger casting. You need to use a homogeneous signature (i.e. both inputs are float32 or float64).
The speedup may still not be huge (this is probably memory-bound anyway), but adding a cast might hide most of it even if it is huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rgommers, @seberg do you guys know how can I run two benchmark for comparing the original implementation and the SIMD implementation? I have changed the benchmark test for using float64, but the elapsed time increased. I think it is quite weird, and the 10% speedup is weird for my side too. I thought it will accelerate much more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For running the benchmark, you could reorder your commits so the benchmark one comes first, and put a new branch name pointing to that commit. Then you can compare that new branch with the simd-copysign
one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordering commits should not even be necessary (although maybe nice) if you use --bench-compare
from asv through runtests: https://numpy.org/doc/stable/benchmarking.html
Not sure if the docs needs updating, I feel I always have to run it from the benchmarks folder with ../runtests.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a missing understanding here, this patch only adds universal intrinsics for copysign but without actually using them in the inner loop of ufunc so there's no need for a benchmark.
@howjmay, First of all, I would like to thank you for your interest in improving the performance of NumPy, we really appreciate your efforts but again, there's no need to add new intrinsics for |
Thank you for informing! |
@howjmay Did you mean to reopen this PR? If not, please close it once again when you get a chance. |
@seiko2plus Should this be closes (again)? |
Closing. @howjmay If you wish to pursue this, please make another PR. |
The benchmark run in the CI gave the following response.