-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new optimizations should be moved into a separated dispatch-able source that only holds targets "baseline vsx4" e.g. loops_modulo.dispatch.c.src to avoid the unnecessary increase of binary size on x86 for the un-affected targets (sse41 avx2 avx512f avx512_skx) within "loops_arithmetic.dispatch.c.src ".
Hi @seiko2plus, thank you for reviewing my code. Below are two points I would like to discuss with you:
I would like to know if even considering the points above I should still move the code to another dispatchable source file. If you still think I should apply the suggested changes, I will do it. |
Hi, I would like to request a review of my code (cc: @mattip , @seiko2plus ) For this PR, specifically, I would like to know if what I said above makes sense, if not, let me know so I will apply the changes that were asked. The errors in the CI seem to be something related to timeout. The PRs Thanks! |
Hi, I analyzed the size of the binaries generated for x86, and as stated by @seiko2plus, the contributions of this PR are indeed generating some increase in the binaries since the code moved from
I will move the operations |
cb74ca6
to
9273233
Compare
This commit optimizes the operations below: - fmod (signed/unsigned integers) - remainder (signed/unsigned integers) - divmod (signed/unsigned integers) - floor_divide (signed integers) using the VSX4/Power10 integer vector division/modulo instructions. See the improvements below (maximum speedup): - numpy.fmod - arr OP arr: signed (1.17x), unsigned (1.13x) - arr OP scalar: signed (1.34x), unsigned (1.29x) - numpy.remainder - arr OP arr: signed (4.19x), unsigned (1.17x) - arr OP scalar: signed (4.87x), unsigned (1.29x) - numpy.divmod - arr OP arr: signed (4.73x), unsigned (1.23x) - arr OP scalar: signed (5.05x), unsigned (1.31x) - numpy.floor_divide - arr OP arr: signed (4.44x) The times above were collected using the benchmark tool available in NumPy.
9273233
to
a14d047
Compare
Hi @seiko2plus, I modified the code as you asked. With the new changes, the binaries generated for the other architectures now have the same size with/without this PR. See below the same table I shared in my previous comment updated:
Thank you! 👍 |
Thanks @rafaelcfsousa |
Thank you @rafaelcfsousa, @mattip. sorry for delayed response |
ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4)
This PR optimizes the NumPy operations: (1) divmod, (2) floor_divide, (3) fmod, and (4) remainder for VSX4/Power10.
In summary, the optimization consists in vectorizing the operations listed above to use the new integer vector modulo/division instructions that are available in the ISA 3.1 (Power10/VSX4).
You will notice that I moved the operations listed above from
loops.c.src
toloops_arithmetic.dispatch.c.src
(except floor_divide).See the reasons for this below:
(1) reuse the code of a recent PR (#20976)
(2) use the CPU features detection (VSX4)
(3) use the universal intrinsics
See below results from a benchmarking I ran on a Power10 machine:
numpy.fmod
numpy.remainder
numpy.divmod
numpy.floor_divide
P.S.: I left 3 lint errors in my code that are the function signatures.