-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH, SIMD: Dispatch for unsigned floor division #18075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There's no need add any new intrinsics for memory operations, please could you share your SIMD kernel code? EDIT: sorry I saw it, my bad. |
79fada9
to
607fd92
Compare
Thanks @seiko2plus for all the info, I have addressed a few of them. Will do the rest in a while. |
@seiko2plus , any idea why Agner did not go for 64 bit? Should we also just do normal built-in division for 64? |
I guess not worth it. unroll, specialize a loop under opt level 3 would be enough. |
I have implemented the SSE versions, but I notice UT failing locally for float. How float? >>> import numpy as np
>>> np.__version__
'0.3.0+24365.g2a4437284'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: divide by zero encountered in floor_divide
inf
>>> >>> import numpy as np
>>> np.__version__
'1.19.2'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: invalid value encountered in floor_divide
nan
>>> I gdb'd into the code, specifically |
Nothing to do with |
@ganesh-k13, I created a new pr #18178 that adds fast integer division intrinsics for all SIMD extensions, it should be merged before this pr. |
Thanks, @seiko2plus , I'll rebase once that's merged and add the dispatches 👍 |
gh-18178 is merged |
2a44372
to
dba8572
Compare
Thanks, Matti, I have rebased to use the latest code, I'll fix errors and port it to signed as well. |
798458c
to
1f3a0f5
Compare
The UT passes except for one case, here is a trace. Any pointers will be helpful:
For masked array, |
I would be surprised if this is limited to masked arrays, they shouldn't really do anything special aside from making a copy and filling some value into masked parts. |
Oh I see. There must be a gap in the test plan as other cases are passing for some reason. Yeah, the const in src might be the culprit, will see what can be done. Thanks for the info |
1a4eb1b
to
812a9aa
Compare
812a9aa
to
0717ae1
Compare
0717ae1
to
7d37f7e
Compare
New Bench, deleting old. Dispatch:########### EXT COMPILER OPTIMIZATION ########### Platform : Architecture: x64 Compiler : gcc CPU baseline : Requested : 'min' Enabled : SSE SSE2 SSE3 Flags : -msse -msse2 -msse3 Extra checks: none CPU dispatch : Requested : 'max -xop -fma4' Enabled : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL Generated : : SSE41 : SSE SSE2 SSE3 SSSE3 Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 Extra checks: none Detect : SSE SSE2 SSE3 SSSE3 SSE41 : numpy/core/src/umath/_umath_tests.dispatch.c : numpy/core/src/umath/loops_arithmetic.dispatch.c : SSE42 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 Extra checks: none Detect : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 : numpy/core/src/_simd/_simd.dispatch.c : AVX2 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mavx2 Extra checks: none Detect : AVX F16C AVX2 : numpy/core/src/umath/_umath_tests.dispatch.c : numpy/core/src/umath/loops_arithm_fp.dispatch.c : numpy/core/src/umath/loops_arithmetic.dispatch.c : (FMA3 AVX2) : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 Extra checks: none Detect : AVX F16C FMA3 AVX2 : numpy/core/src/_simd/_simd.dispatch.c : numpy/core/src/umath/loops_exponent_log.dispatch.c : numpy/core/src/umath/loops_trigonometric.dispatch.c : AVX512F : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f Extra checks: AVX512F_REDUCE Detect : AVX512F : numpy/core/src/_simd/_simd.dispatch.c : numpy/core/src/umath/loops_arithm_fp.dispatch.c : numpy/core/src/umath/loops_arithmetic.dispatch.c : numpy/core/src/umath/loops_exponent_log.dispatch.c : numpy/core/src/umath/loops_trigonometric.dispatch.c : AVX512_SKX : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq Extra checks: AVX512BW_MASK AVX512DQ_MASK Detect : AVX512_SKX : numpy/core/src/_simd/_simd.dispatch.c : numpy/core/src/umath/loops_arithmetic.dispatch.c : numpy/core/src/umath/loops_exponent_log.dispatch.c
|
7d37f7e
to
a2c5af9
Compare
Hey @seiko2plus , any more tests/changes needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one thing left.
@ganesh-k13, My apologies for the delayed response, here another benchmark that covers other archs. X86CPUArchitecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping: 10
CPU MHz: 1800.344
CPU max MHz: 4000.0000
CPU min MHz: 400.0000
BogoMIPS: 3984.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx OSLinux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0 BenchmarkAVX2python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.3±0.4μs 3.31±0.07μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 22.5±0.4μs 3.28±0.05μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 70.9±1μs 7.58±0.07μs 0.11 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 74.1±3μs 7.72±0.2μs 0.10 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 24.8±0.4μs 2.10±0.03μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 21.8±0.5μs 1.80±0.04μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
- 25.5±0.6μs 2.05±0.06μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.2±0.2μs 1.77±0.05μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8) SSE41export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.9±0.4μs 4.50±0.2μs 0.20 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 23.7±1μs 4.38±0.06μs 0.18 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 70.5±1μs 11.6±0.3μs 0.16 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 77.8±3μs 11.5±0.2μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 22.5±0.2μs 2.05±0.2μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
- 25.2±0.4μs 2.29±0.05μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.3±0.8μs 1.98±0.04μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 25.2±0.6μs 2.19±0.03μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43) SSE3export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.8±0.5μs 4.82±0.1μs 0.21 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 22.9±1μs 4.78±0.1μs 0.21 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 72.8±3μs 12.2±0.6μs 0.17 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 71.7±1μs 12.0±0.5μs 0.17 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 25.7±1μs 2.43±0.1μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 25.2±0.3μs 2.38±0.05μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.6±0.5μs 2.11±0.04μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 22.5±0.4μs 2.10±0.03μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43) Power little-endianCPU
OSLinux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) BenchmarkVSX2python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [46b8cfc3]
<enh_simd_npyv_floor_div>
- 16.9±0.02μs 10.1±0.01μs 0.59 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 16.3±0.03μs 7.25±0.01μs 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 16.4±0.05μs 7.24±0.01μs 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 25.9±0.03μs 10.1±0.01μs 0.39 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 16.6±0.03μs 5.56±0.01μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 16.6±0.03μs 5.55±0.02μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43) AArch64CPUArchitecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 2314.0000
CPU min MHz: 403.0000
BogoMIPS: 52.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid OSLinux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 BenchmarkNEONpython runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[036f6c68] [f3eb831d]
<enh_simd_npyv_floor_div>
- 28.9±0.08μs 19.9±0.04μs 0.69 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 28.6±0.08μs 12.2±0.03μs 0.43 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 52.3±0.05μs 20.0±0.04μs 0.38 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 34.6±0.08μs 12.1±0.03μs 0.35 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 28.0±0.06μs 8.05±0.04μs 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 28.0±0.05μs 8.04±0.05μs 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43) |
Co-authored-by: Sayed Adel <seiko@imavr.com>
Thanks a lot @seiko2plus! |
Also can I raise a PR for signed types now, @seiko2plus ? I'll purge libdivide where needed. |
Thanks @ganesh-k13 |
@ganesh-k13, sure you can, thank you! |
The change the change in the loops caused a small regression for:
conveniently found by pandas in pandas-dev/pandas#40874 I think all that is needed is to ensure the loop order is not change in the code generation chunk (unsigned loops before signed ones). And a test to make sure we are the ones who notice it next time would be nice ;). I can create a PR, but if you beat me to it, even better :). EDIT: The current and new loops are as follows (So I got it wrong above, it is signed before unsigned – of identical precision):
New loops:
|
Hey Sebastian, sorry for the regression. I have made the change locally and able to fix with your RCA. I am adding tests and will raise a PR shortly. |
BUG: Regression #18075 | Fixing Ufunc TD generation order
Dispatch for unsigned floor division
Hi @seiko2plus / @Qiyu8 , I am attempting to add fast integer division using universal intrinsics. I wanted your opinion on my approach.
I have marked much of the hardcoded parts with
// XXX
(everything is for16bit
and onlysigned
). Now I am able to understand the dispatch mechanism to an extent and hit the code paths through dispatch(not through below diff, I hardcoded fewrun_binary_simd_*
for that). But there are few things that do not work with integer types like load and store.We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?cc: @seberg