Skip to content

ENH, SIMD: improve argmax/argmin performance #20846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 13, 2022

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Jan 18, 2022

for all integers, f32 and f64 data types on all
supported architectures via universal intrinsics.

related to #20785, #20131

X86

CPU
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:                        4
CPU MHz:                         3410.808
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant
                                 _tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                                 tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep b
                                 mi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat
                                 pku ospke
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.10
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

AVX512_SKX
unset NPY_DISABLE_CPU_FEATURES
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.66±0ms         1.54±0ms     0.93  bench_lib.Nan.time_nanargmax(200000, 50.0)
-         926±3μs          850±3μs     0.92  bench_lib.Nan.time_nanargmax(200000, 90.0)
-        1.72±0ms         1.55±0ms     0.90  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         380±3μs        325±0.4μs     0.86  bench_lib.Nan.time_nanargmax(200000, 0.1)
-         993±4μs          849±3μs     0.85  bench_lib.Nan.time_nanargmin(200000, 90.0)
-         378±3μs        323±0.2μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0)
-         482±3μs        406±0.8μs     0.84  bench_lib.Nan.time_nanargmax(200000, 2.0)
-         548±3μs        407±0.8μs     0.74  bench_lib.Nan.time_nanargmin(200000, 2.0)
-         444±3μs        326±0.9μs     0.73  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         441±3μs        322±0.9μs     0.73  bench_lib.Nan.time_nanargmin(200000, 0)
-     6.49±0.08μs      4.45±0.01μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       182±0.3μs       59.8±0.4μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       182±0.1μs       59.6±0.6μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-      182±0.07μs       59.2±0.2μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       124±0.4μs        26.8±20μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-       122±0.4μs       18.9±0.2μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       180±0.2μs       18.8±0.1μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-       180±0.2μs       18.6±0.4μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         202±4μs        19.1±20μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         204±4μs        19.1±20μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       120±0.1μs      9.56±0.04μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       120±0.2μs      9.54±0.02μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         218±3μs         13.7±2μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-         193±5μs         11.0±4μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         212±2μs         11.0±3μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-      179±0.07μs      6.15±0.02μs     0.03  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      179±0.05μs      6.12±0.02μs     0.03  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       218±0.7μs         7.07±1μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-         219±1μs         7.05±1μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-         190±2μs      6.01±0.02μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-         191±3μs      6.02±0.01μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-         964±3μs         895±10μs     0.93  bench_lib.Nan.time_nanargmin(200000, 90.0)
-        1.65±0ms         1.53±0ms     0.93  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         365±1μs        332±0.9μs     0.91  bench_lib.Nan.time_nanargmax(200000, 0)
-       370±0.5μs          336±1μs     0.91  bench_lib.Nan.time_nanargmax(200000, 0.1)
-     6.49±0.06μs      5.80±0.04μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-         470±2μs          417±1μs     0.89  bench_lib.Nan.time_nanargmax(200000, 2.0)
-         532±2μs        418±0.5μs     0.79  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       432±0.8μs        335±0.6μs     0.78  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         428±2μs          332±1μs     0.77  bench_lib.Nan.time_nanargmin(200000, 0)
-       124±0.1μs         61.4±7μs     0.50  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-      182±0.09μs       75.1±0.3μs     0.41  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-      182±0.08μs       71.4±0.3μs     0.39  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-      182±0.09μs       66.2±0.3μs     0.36  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         202±4μs         59.5±6μs     0.29  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-       122±0.3μs       35.3±0.3μs     0.29  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-         203±3μs         49.7±8μs     0.25  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-      180±0.09μs       32.5±0.3μs     0.18  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       180±0.1μs       28.8±0.2μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         198±4μs         30.0±2μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         217±4μs         32.0±2μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       120±0.2μs      17.3±0.05μs     0.14  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-       120±0.1μs      15.2±0.02μs     0.13  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-         213±5μs         25.2±1μs     0.12  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       218±0.7μs       16.4±0.2μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-         217±2μs      15.3±0.03μs     0.07  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      179±0.05μs      9.81±0.04μs     0.05  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-         193±4μs      9.80±0.04μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-      179±0.07μs      8.47±0.01μs     0.05  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-         193±3μs      8.64±0.03μs     0.04  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
SSE42
export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.60±0ms         1.52±0ms     0.95  bench_lib.Nan.time_nanargmax(200000, 50.0)
-       470±0.7μs          442±4μs     0.94  bench_lib.Nan.time_nanargmax(200000, 2.0)
-     6.55±0.08μs      6.05±0.02μs     0.92  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-        1.66±0ms         1.51±0ms     0.91  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         964±2μs          865±2μs     0.90  bench_lib.Nan.time_nanargmin(200000, 90.0)
-       430±0.9μs          359±3μs     0.84  bench_lib.Nan.time_nanargmin(200000, 0)
-         433±1μs          360±3μs     0.83  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         534±1μs          441±3μs     0.83  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       124±0.1μs         95.6±3μs     0.77  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-       182±0.1μs        102±0.9μs     0.56  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       182±0.1μs        101±0.4μs     0.55  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-       182±0.2μs       81.6±0.2μs     0.45  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         122±4μs       54.3±0.4μs     0.44  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-        213±10μs         89.3±6μs     0.42  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         209±8μs         72.9±6μs     0.35  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       180±0.2μs       48.3±0.2μs     0.27  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         218±3μs         48.4±2μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-      180±0.06μs       39.3±0.4μs     0.22  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-        209±10μs         45.2±1μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         122±2μs      24.1±0.03μs     0.20  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         219±4μs         38.7±1μs     0.18  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-         120±2μs      19.4±0.01μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       219±0.5μs      24.2±0.05μs     0.11  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       219±0.9μs      20.5±0.04μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      179±0.05μs      13.2±0.01μs     0.07  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-        207±10μs      13.3±0.03μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       179±0.1μs      11.0±0.02μs     0.06  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-        209±10μs      11.1±0.02μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
BASELINE(SSE3)
export NPY_DISABLE_CPU_FEATURES="SSE42 AVX2 AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.66±0ms      1.54±0.01ms     0.93  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         198±5μs          183±1μs     0.93  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         967±3μs          883±2μs     0.91  bench_lib.Nan.time_nanargmin(200000, 90.0)
-         433±2μs          376±4μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0)
-         438±2μs          377±3μs     0.86  bench_lib.Nan.time_nanargmin(200000, 0.1)
-       124±0.3μs          106±6μs     0.86  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-         535±3μs          457±4μs     0.85  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       182±0.2μs        153±0.6μs     0.84  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         202±3μs          150±1μs     0.75  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       182±0.2μs          115±2μs     0.63  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       122±0.4μs       64.2±0.9μs     0.53  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       180±0.1μs       58.4±0.6μs     0.32  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       180±0.2μs       50.4±0.8μs     0.28  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         197±7μs         55.0±2μs     0.28  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         218±3μs         58.4±3μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       121±0.2μs       29.6±0.5μs     0.25  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         213±2μs         47.0±1μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       120±0.3μs       24.0±0.4μs     0.20  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       219±0.4μs       29.3±0.2μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       219±0.9μs       23.7±0.2μs     0.11  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-       179±0.2μs       17.7±0.1μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-         192±1μs      17.7±0.08μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       179±0.1μs      15.4±0.06μs     0.09  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-         188±2μs      14.8±0.08μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(VSX2)
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-     1.90±0.01ms         1.78±0ms     0.93  bench_lib.Nan.time_nanargmin(200000, 90.0)
-       294±0.6μs        275±0.9μs     0.93  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-     1.10±0.01ms          997±7μs     0.90  bench_lib.Nan.time_nanargmin(200000, 2.0)
-         896±6μs          778±2μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0.1)
-        885±10μs          767±2μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0)
-         373±2μs        275±0.3μs     0.74  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       219±0.2μs        156±0.5μs     0.71  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         219±2μs          154±1μs     0.70  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-      218±0.08μs        152±0.3μs     0.70  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       218±0.2μs       152±0.08μs     0.70  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-       218±0.3μs          148±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-       219±0.8μs          149±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-      237±0.09μs        156±0.4μs     0.66  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         235±1μs          155±1μs     0.66  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       329±0.2μs        178±0.3μs     0.54  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-         327±1μs          176±1μs     0.54  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-      216±0.07μs       50.3±0.1μs     0.23  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         215±1μs         50.0±1μs     0.23  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       233±0.5μs      50.6±0.06μs     0.22  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       232±0.9μs       48.9±0.9μs     0.21  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      216±0.06μs       28.7±0.1μs     0.13  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       216±0.6μs      28.5±0.04μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       214±0.5μs       28.0±0.1μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       233±0.1μs       28.6±0.1μs     0.12  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       233±0.2μs       28.6±0.1μs     0.12  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)

AArch64

CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        2 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(ASIMD)
python runtests.py --bench-compare parent/main "argmax|argmin" -- --sort ratio
       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-     20.5±0.02μs      18.2±0.01μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-         169±2μs         88.9±4μs     0.53  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         169±2μs         88.9±3μs     0.53  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       165±0.1μs       83.1±0.6μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       165±0.3μs       82.8±0.5μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-         246±1μs       81.1±0.9μs     0.33  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       244±0.2μs       79.1±0.2μs     0.32  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       166±0.8μs         45.1±2μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       166±0.9μs         44.8±1μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-       164±0.2μs       43.1±0.3μs     0.26  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-      164±0.06μs       43.0±0.2μs     0.26  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       170±0.4μs       24.9±0.5μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-       170±0.3μs       24.9±0.5μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       169±0.1μs       24.6±0.2μs     0.15  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       169±0.2μs      24.5±0.07μs     0.15  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-       163±0.1μs      13.9±0.04μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       163±0.1μs      13.9±0.09μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      163±0.09μs      13.8±0.01μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       163±0.1μs      13.8±0.03μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)

Binary size(striped)

LIB Before(KB) After(KB) Diff(KB)
_multiarray_umath.cpython-38-x86_64-linux-gnu.so 4428 4532 104
_multiarray_umath.cpython-38-powerpc64le-linux-gnu.so 4264 4292 29
_multiarray_umath.cpython-38-aarch64-linux-gnu.so 3424 3444 20

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 18, 2022
@seiko2plus seiko2plus marked this pull request as ready for review January 18, 2022 03:37
Copy link
Member Author

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npyv_cleanup() missing, it's important to be represented in case of the compiler doesn't generate zeroupper or zeroall to avoid the AVX-SSE transition penalty.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool speedups! If I understand correctly, the strategy is to use 512-bit wide vectors (svx) where possible and "emulate" them with smaller vectors where not possible. This provides the most compact code since we do not need 256- or 128-bit-specific loops. Does the compiler "do the right thing" to make the code run at the same speed as if we wrote 128-bit specific loops?

There are a few spots that coverage claims are not hit.


def time_argmin(self, dtype):
np.argmin(self.d)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overkill? Is there any real difference between argmax and argmin? I guess there might slight differences since they are different loops

@seiko2plus
Copy link
Member Author

@mattip, sorry, I don't clearly understand your question but this implementation works perfectly with all current supported of SIMD widths (128, 256, 512) and there's no kind of scalar emulation. it just requires the minimum length of the array to be SIMD_WIDTH*4 because the reduced operation for the final vector accumulator isn't cheap and it will cause performance regressions on small arrays. also, the reason behind unrolling by 4 vector loads for each iteration is based on the fact that most of the new CPUs nowadays can execute multiple instructions logical, bitwise, and comparison operations at the same time.

@mattip
Copy link
Member

mattip commented Jan 19, 2022

Thanks, unrolling by 4 was the thing that caught my eye, and your explanation makes it clear.

@mattip
Copy link
Member

mattip commented Jan 19, 2022

Looking at the benchmark timings, it seems to be that across architectures the current code is operation-bound (all the dtypes except bool take approximately the same time), where the new code make the ufunc performance more memory-bound (smaller dtypes are faster), and that there is a cost to processing nans.

@seiko2plus
Copy link
Member Author

seiko2plus commented Jan 20, 2022

@mattip,

except bool take approximately the same time

for two reasons:

  1. BOOL_argmin() isn't included by pr because memchr is well optimized by GCC and CLANG.

static int
BOOL_argmin(npy_bool *ip, npy_intp n, npy_intp *min_ind,
PyArrayObject *NPY_UNUSED(aip))
{
npy_bool * p = memchr(ip, 0, n * sizeof(*ip));
if (p == NULL) {
*min_ind = 0;
return 0;
}
*min_ind = p - ip;
return 0;
}

  1. BOOL_argmax() was already vectorized by raw SIMD for both x86 and aarch64. note on VSX, it shows pretty good improvement:
-       214±0.5μs       28.0±0.1μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'bool'>)

that there is a cost to processing nans.

that's true but I can give the priority for numeric values and gain around 40% speed but that will cause a pretty bad impact on nanargmax/nanargmin performance. my idea is to invert the FP test and avoid breaking the loop. By the way, the nan benchmark isn't fear enough, and it doesn't expose the real improvements. see:

base_array = np.random.uniform(size=array_size)
base_array[base_array < percent_nans / 100.] = np.nan
self.arr = base_array

@seiko2plus seiko2plus force-pushed the simd_argmaxmin branch 2 times, most recently from 7f32216 to 88c13b9 Compare February 13, 2022 12:27
  for all integers, f32 and f64 data types on all
  supported architectures via universal intrinsics.
to avoid the AVX-SSE transition penalty.
@seiko2plus
Copy link
Member Author

@mattip, code coverage almost hit all the paths. I think it's ready for merge.

@mattip mattip merged commit 7fe5d7f into numpy:main Feb 13, 2022
@mattip
Copy link
Member

mattip commented Feb 13, 2022

Thanks @seiko2plus

@mattip
Copy link
Member

mattip commented Feb 13, 2022

I forgot, do we add release notes for these performance enhancements? Maybe we could add a single note and mention all the ufuncs we have converted to SIMD.

@seberg
Copy link
Member

seberg commented Feb 13, 2022

We have a "performance" category. So we do, but are very very spotty about it. It is also OK to add multiple for each I think, just make sure to use the brief bullet format * Improved performance of ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants