ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

ganesh-k13 · 2021-04-13T16:11:07Z

Dispatch for signed floor division

Added dispatch for signed floor divide.
Added floor logic on top of trunc. Based on Division by Invariant Integers using Multiplication by T. Granlund and P. L. Montgomery

Continues work of #18178 and #18075.

TODO items:

~~- [ ] Modify timedelta code where libdivide is used,~~
~~- [ ] Purge process of libdivide. Any license related changes needed(?)~~

CC: @seiko2plus , @seberg

Benchmarks

X86

CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx

OS

Linux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0

Benchmark

AVX2

python runtests.py --bench-compare parent/main time_floor_divide_int

  before           after         ratio
[3a61a14b]       [f505827a]
<enh_simd_signed_division~10>     <enh_simd_signed_division>
  105±0.07μs      23.9±0.06μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
     109±3μs      23.7±0.04μs     0.22  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   135±0.6μs       23.9±0.8μs     0.18  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   136±0.6μs      23.5±0.05μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
  97.5±0.5μs      8.89±0.08μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
  98.1±0.5μs       8.88±0.2μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
     116±3μs      8.82±0.01μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
     118±2μs      8.90±0.09μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   142±0.7μs      5.04±0.08μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
     142±1μs       4.92±0.2μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   154±0.5μs       4.86±0.2μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
   154±0.4μs      4.84±0.01μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
   152±0.2μs       4.75±0.3μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
     149±7μs      4.62±0.03μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
     146±7μs       4.52±0.1μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
     152±5μs       4.45±0.1μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)

SSE41

export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int

 before           after         ratio
[623bc1fa]       [a2c5af9c]
<enh_simd_signed_division~10>       <enh_simd_signed_division>
 106±0.4μs       57.3±0.2μs     0.54  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   106±4μs       56.5±0.3μs     0.53  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
 135±0.7μs         57.9±2μs     0.43  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   140±4μs       57.4±0.6μs     0.41  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
97.9±0.3μs         14.7±1μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
   102±9μs         14.7±2μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   115±2μs         13.8±1μs     0.12  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
   118±2μs       13.6±0.2μs     0.12  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   141±1μs      6.15±0.04μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
   149±6μs       6.47±0.2μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   148±6μs      6.33±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
   147±6μs      6.14±0.02μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
   156±2μs       6.44±0.5μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
   157±2μs       6.46±0.3μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
 152±0.6μs       6.19±0.1μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
   152±1μs      6.15±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)

SSE3

export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int

 before           after         ratio
[623bc1fa]       [a2c5af9c]
 105±0.3μs       56.2±0.2μs     0.54  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   113±9μs       56.2±0.2μs     0.50  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
   140±4μs       56.2±0.2μs     0.40  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   143±4μs       56.5±0.1μs     0.39  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
   103±5μs      17.0±0.08μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   107±4μs      17.3±0.07μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
 116±0.7μs      17.1±0.08μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
   117±1μs      17.0±0.08μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   156±6μs         7.13±4μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   148±7μs       6.76±0.4μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
   144±2μs      6.41±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
   150±4μs      6.51±0.09μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
   157±2μs       6.69±0.1μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
  174±30μs         7.39±2μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
 152±0.8μs      6.41±0.03μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
   153±2μs      6.43±0.06μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2

python runtests.py --bench-compare parent/main time_floor_divide_int

  before           after         ratio
[290a0345]       [2c393340]
<main>           <enh_simd_signed_division>
 64.2±0.2μs       60.8±0.2μs     0.95  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
 64.3±0.2μs       60.7±0.2μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
 64.5±0.2μs       60.0±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
 64.5±0.2μs       59.9±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
 66.3±0.2μs      15.2±0.01μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
 66.3±0.3μs       15.1±0.7μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
 66.3±0.1μs      15.1±0.01μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
 66.4±0.2μs      15.1±0.02μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
66.0±0.07μs      9.86±0.02μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
66.2±0.08μs      9.84±0.01μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
72.3±0.09μs      9.84±0.03μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
72.5±0.09μs      9.85±0.02μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
 64.7±0.3μs      6.82±0.02μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
 65.0±0.2μs      6.83±0.01μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
 70.5±0.2μs      6.83±0.01μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
 70.9±0.2μs      6.83±0.01μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)

AArch64

CPU

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

OS

Linux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

NEON

python runtests.py --bench-compare parent/main time_floor_divide_int

  before           after         ratio
[3a61a14b]       [f505827a]
<enh_simd_signed_division~10>       <enh_simd_signed_division>
     109±3μs         98.2±2μs     0.90  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
     112±3μs         97.7±2μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
     117±1μs       89.6±0.7μs     0.77  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
     117±2μs         89.2±1μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
     107±3μs       30.9±0.6μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
     110±3μs       31.2±0.6μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
     111±1μs       31.1±0.5μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   112±0.2μs       31.1±0.6μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
     104±3μs       17.7±0.3μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
     106±2μs       17.9±0.5μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   107±0.8μs       17.6±0.4μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
     111±3μs       17.9±0.4μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
     104±3μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
     107±1μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
     112±3μs       10.8±0.3μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
     113±1μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)

Initialed benchmark(before the latest modifications runned on different X86/HW than the above one)

Basic Benchmarks:

 
CPU dispatch  :                                                                                                                                                                                    
  Requested   : 'max -xop -fma4'                                                                                                                                                                   
  Enabled     : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL                                                     
  Generated   : 
=================================
       before           after         ratio
     [635b3ad8]       [cf41b9c4]
     <main>           <enh_simd_signed_division>
-      33.9±0.5μs      9.84±0.03μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
-      33.9±0.5μs      9.82±0.03μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
-      40.3±0.4μs       9.82±0.1μs     0.24  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
-      40.2±0.4μs      9.78±0.07μs     0.24  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
-        28.0±1μs      4.55±0.03μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
-        28.2±1μs      4.55±0.02μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
-      42.3±0.4μs      4.59±0.01μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
-      42.3±0.3μs      4.58±0.02μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
-        30.7±1μs      2.44±0.01μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
-      32.7±0.3μs      2.43±0.02μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
-      34.6±0.8μs      2.53±0.01μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
-      34.9±0.3μs      2.51±0.02μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-      45.6±0.4μs      2.49±0.07μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
-      45.7±0.1μs      2.41±0.01μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
-      51.3±0.7μs      2.53±0.03μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-      51.6±0.3μs      2.51±0.01μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

numpy/core/src/umath/loops_arithmetic.dispatch.c.src

Qiyu8 · 2021-04-16T09:16:02Z

Each dispatched cpu feature should have a corresponding benchmark.

numpy/core/src/umath/loops_arithmetic.dispatch.c.src

charris · 2021-04-22T17:08:57Z

close/reopen

ganesh-k13 · 2021-05-11T14:18:27Z

Whats the difference between TIMEDELTA_mq_m_divide and TIMEDELTA_mm_q_floor_divide? Cause TIMEDELTA_mm_q_floor_divide is not being used. Also it seems to be #define here:

numpy/numpy/core/src/umath/loops.h.src

Line 612 in 99b396b

#define TIMEDELTA_mq_m_floor_divide TIMEDELTA_mq_m_divide

EDIT: My Bad, saw that one of them is mm and that's the input types.

ganesh-k13 · 2021-05-11T16:12:47Z

@seiko2plus what say we keep this PR only for signed types? I'll take up timedelta in a new PR as it's slightly different with a need for NAT check and also will add benchmarks, grouping them in a single PR(along with the removal of libdivide). Let me know your thoughts on this.

seiko2plus · 2021-05-13T07:27:20Z

@ganesh-k13,

what say we keep this PR only for signed types

I think it's fine to handle the remained work in a another pr.

mattip · 2021-05-14T10:14:29Z

@seiko2plus, @Qiyu8 still have open "changes requested", among them to re-run the relevant benchmarks.

- Revert unsigned integer division changes, overflow check only required by signed division. - Fix floor round for reduce divison - cleanup - revert fixme comment

When the divisor is equal to the minimum integer value, It was affected by gcc 9.3 only and under certain conditions of aggressive optimization.

seiko2plus · 2021-05-21T03:35:31Z

I have made several fixes/improvements explained in the latest commit log and have also deleted the overflow-test patch since it's already included by PR #19046.

seiko2plus · 2021-05-21T03:38:57Z

New benchmark results have added to the PR description.

seiko2plus

LGTM, Thank you Ganesh.

mattip · 2021-05-21T13:27:23Z

Thanks @ganesh-k13, @seiko2plus. The speedups are quite nice.

ganesh-k13 · 2021-05-23T05:53:08Z

Thanks a lot, @seiko2plus!

seberg · 2021-10-20T18:27:39Z

Does anyone here have time to look into gh-20025? It seems like a pretty bad bug, and we should make sure to fix it before the 1.22 release at least.

github-actions bot added the 01 - Enhancement label Apr 13, 2021

ganesh-k13 commented Apr 13, 2021

View reviewed changes

numpy/core/src/umath/loops_arithmetic.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus requested changes Apr 16, 2021

View reviewed changes

numpy/core/src/umath/loops_arithmetic.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus reviewed Apr 16, 2021

View reviewed changes

numpy/core/src/umath/loops_arithmetic.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Apr 16, 2021

Qiyu8 requested changes Apr 16, 2021

View reviewed changes

numpy/core/src/umath/loops_arithmetic.dispatch.c.src Outdated Show resolved Hide resolved

ganesh-k13 force-pushed the enh_simd_signed_division branch from 5e9589c to f96211e Compare April 21, 2021 16:31

seiko2plus requested changes Apr 21, 2021

View reviewed changes

charris closed this Apr 22, 2021

charris reopened this Apr 22, 2021

ganesh-k13 force-pushed the enh_simd_signed_division branch from e2e6908 to 4dc28da Compare May 2, 2021 13:13

ganesh-k13 force-pushed the enh_simd_signed_division branch from 4dc28da to 7651c58 Compare May 11, 2021 16:09

seiko2plus mentioned this pull request May 19, 2021

BUG, SIMD: Fix unexpected result of uint8 division on X86 #19046

Merged

seiko2plus force-pushed the enh_simd_signed_division branch from c056c42 to f505827 Compare May 19, 2021 18:33

ganesh-k13 added 8 commits May 20, 2021 23:19

SIMD: Removed umath code

0d3bd92

SIMD: Add signed to dispatch

b000d5d

SIMD: Add dispatch to generate_umath

9bb27e8

SIMD: Added dispatch code for signed

7f9d342

SIMD: Added floor divide logic for signed

0b8838e

DOC: Added floor divide doc

b1c3c98

SIMD: Refined signed and unsigned floor divide

b6b3267

SIMD: Separate signed and unsigned loops

7c16367

seiko2plus force-pushed the enh_simd_signed_division branch from f505827 to b5193d7 Compare May 20, 2021 21:20

MAINT, SIMD: Several fixes to integer division

4619081

- Revert unsigned integer division changes, overflow check only required by signed division. - Fix floor round for reduce divison - cleanup - revert fixme comment

seiko2plus force-pushed the enh_simd_signed_division branch from b5193d7 to 4619081 Compare May 20, 2021 21:24

SIMD: Fix computing the fast int32 division parameters

f74f500

When the divisor is equal to the minimum integer value, It was affected by gcc 9.3 only and under certain conditions of aggressive optimization.

seiko2plus force-pushed the enh_simd_signed_division branch from e4f575d to f74f500 Compare May 21, 2021 03:24

seiko2plus changed the title ~~ENH: SIMD signed division dispatch~~ ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics May 21, 2021

seiko2plus approved these changes May 21, 2021

View reviewed changes

Qiyu8 approved these changes May 21, 2021

View reviewed changes

mattip merged commit 7de0fa9 into numpy:main May 21, 2021

mattip mentioned this pull request Aug 30, 2021

ENH: Add SIMD implementation for deg2rad #19779

Closed

seiko2plus mentioned this pull request Oct 23, 2021

BUG: inaccurate 8-bit unsigned integer division when the divisor is a scalar #20168

Closed

ganesh-k13 mentioned this pull request Jul 10, 2022

BUG: Better report integer division overflow #21507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

ganesh-k13 commented Apr 13, 2021 •

edited by seiko2plus

Loading

Qiyu8 commented Apr 16, 2021

charris commented Apr 22, 2021

ganesh-k13 commented May 11, 2021 •

edited

Loading

ganesh-k13 commented May 11, 2021

seiko2plus commented May 13, 2021

mattip commented May 14, 2021

seiko2plus commented May 21, 2021

seiko2plus commented May 21, 2021

seiko2plus left a comment

mattip commented May 21, 2021

ganesh-k13 commented May 23, 2021

seberg commented Oct 20, 2021

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

Conversation

ganesh-k13 commented Apr 13, 2021 • edited by seiko2plus Loading

Dispatch for signed floor division

Benchmarks

X86

Benchmark

Power little-endian

Benchmark

AArch64

Benchmark

Initialed benchmark(before the latest modifications runned on different X86/HW than the above one)

Qiyu8 commented Apr 16, 2021

charris commented Apr 22, 2021

ganesh-k13 commented May 11, 2021 • edited Loading

ganesh-k13 commented May 11, 2021

seiko2plus commented May 13, 2021

mattip commented May 14, 2021

seiko2plus commented May 21, 2021

seiko2plus commented May 21, 2021

seiko2plus left a comment

Choose a reason for hiding this comment

mattip commented May 21, 2021

ganesh-k13 commented May 23, 2021

seberg commented Oct 20, 2021

ganesh-k13 commented Apr 13, 2021 •

edited by seiko2plus

Loading

ganesh-k13 commented May 11, 2021 •

edited

Loading