-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH, SIMD: improve argmax/argmin performance #20846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npyv_cleanup() missing, it's important to be represented in case of the compiler doesn't generate zeroupper
or zeroall
to avoid the AVX-SSE transition penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool speedups! If I understand correctly, the strategy is to use 512-bit wide vectors (svx) where possible and "emulate" them with smaller vectors where not possible. This provides the most compact code since we do not need 256- or 128-bit-specific loops. Does the compiler "do the right thing" to make the code run at the same speed as if we wrote 128-bit specific loops?
There are a few spots that coverage claims are not hit.
|
||
def time_argmin(self, dtype): | ||
np.argmin(self.d) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe overkill? Is there any real difference between argmax and argmin? I guess there might slight differences since they are different loops
@mattip, sorry, I don't clearly understand your question but this implementation works perfectly with all current supported of SIMD widths (128, 256, 512) and there's no kind of scalar emulation. it just requires the minimum length of the array to be SIMD_WIDTH*4 because the reduced operation for the final vector accumulator isn't cheap and it will cause performance regressions on small arrays. also, the reason behind unrolling by 4 vector loads for each iteration is based on the fact that most of the new CPUs nowadays can execute multiple instructions logical, bitwise, and comparison operations at the same time. |
Thanks, unrolling by 4 was the thing that caught my eye, and your explanation makes it clear. |
Looking at the benchmark timings, it seems to be that across architectures the current code is operation-bound (all the dtypes except bool take approximately the same time), where the new code make the ufunc performance more memory-bound (smaller dtypes are faster), and that there is a cost to processing nans. |
3c39c06
to
ed84326
Compare
for two reasons:
numpy/numpy/core/src/multiarray/arraytypes.c.src Lines 3262 to 3274 in ed84326
- 214±0.5μs 28.0±0.1μs 0.13 bench_reduce.ArgMax.time_argmax(<class 'bool'>)
that's true but I can give the priority for numeric values and gain around 40% speed but that will cause a pretty bad impact on nanargmax/nanargmin performance. my idea is to invert the FP test and avoid breaking the loop. By the way, the nan benchmark isn't fear enough, and it doesn't expose the real improvements. see: numpy/benchmarks/benchmarks/bench_lib.py Lines 72 to 74 in ed84326
|
ed84326
to
60972df
Compare
7f32216
to
88c13b9
Compare
for all integers, f32 and f64 data types on all supported architectures via universal intrinsics.
to avoid the AVX-SSE transition penalty.
88c13b9
to
52787cc
Compare
@mattip, code coverage almost hit all the paths. I think it's ready for merge. |
Thanks @seiko2plus |
I forgot, do we add release notes for these performance enhancements? Maybe we could add a single note and mention all the ufuncs we have converted to SIMD. |
We have a "performance" category. So we do, but are very very spotty about it. It is also OK to add multiple for each I think, just make sure to use the brief bullet format |
for all integers, f32 and f64 data types on all
supported architectures via universal intrinsics.
related to #20785, #20131
X86
CPU
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Python 3.8.10 gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
AVX512_SKX
AVX2
SSE42
BASELINE(SSE3)
Power little-endian
CPU
OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
baseline(VSX2)
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
AArch64
CPU
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Benchmark
baseline(ASIMD)
python runtests.py --bench-compare parent/main "argmax|argmin" -- --sort ratio
Binary size(striped)