-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Improve the performance of einsum by using universal simd #17049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@seiko2plus Can you add Power8/9 benchmark result here? Feel free to add more benchmark test cases. |
What compiler version are you using? I'm a little worried that we're manually optimizing things that a clever compiler can do for us, but we're benchmarking on a compiler that doesn't. At a glance, GCC 10 emits very similar code to this manual vectorization for a simple for loop. I think all SIMD benchmarks at a minimum need to state the compiler version for the "before", and should probably include a column with the latest-and-greated compiler for comparison. |
in X86 platform, I'm using MSVC Compiler (version 14.26.28801), with the args |
It looks like MSVC 16.3 adds the autovectorized AVX-512 support, so it might be worth trying with that for comparison. |
I think you mean Visual Studio version 16.3, my Visual studio version is 16.4, so AVX-512 auto vectorizer support is available, Here is the Benchmark result with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few last comments.
Are the benchmark results at the top of the PR current for the last changeset? I think 2e713b0 is the last commit. Does the size of _multiarray_umath.so
change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rev[1/4], improve reduce the sum on X86(SSE, AVX2)
This patch doesn't cause any performance changes, it just aims to simplify the review process for numpy#17049, according to numpy#17049 (comment)
This should be good to rebase now. You should be able to delete |
From the first glance on the benchmark numbers, on x86 platform it seems to me that AVX2 and AVX-512 isn't providing any speed up relative to SSE. Is it worth adding extra code in the library for no benefit? |
@mattip some small arrays got radio of 1.05 after running multiple times, which I think is caused by normal performance jitter. |
@eric-wieser Any other Suggestions. Thanks |
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
@mattip @eric-wieser The last commits are comments and typos modification, no impact on performance. |
I can split this PR to 9 following PRs if It's too large to merge.
|
PRs to do (1) and (2) have been merged. |
What's the status of this PR? Did all the components end up as separate PRs? If so, can we link them here then close this? |
@eric-wieser The final part is about to come. |
Mission Accomplished, closing now. |
Rebased from #16641 in order to start a cleanup review , The optimization resulted in a performance improvement of 10%~77% Here is the benchmark result :
X86 machine
AVX512F Enabled
ARM machine
NEON Enabled
Co-Author: @seiko2plus