Skip to content

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 21, 2021

Conversation

ganesh-k13
Copy link
Member

@ganesh-k13 ganesh-k13 commented Apr 13, 2021

Dispatch for signed floor division

  1. Added dispatch for signed floor divide.
  2. Added floor logic on top of trunc. Based on Division by Invariant Integers using Multiplication by T. Granlund and P. L. Montgomery

Continues work of #18178 and #18075.

TODO items:

- [ ] Modify timedelta code where libdivide is used,
- [ ] Purge process of libdivide. Any license related changes needed(?)

CC: @seiko2plus , @seberg

Benchmarks

X86

CPU
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx
OS
Linux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0

Benchmark

AVX2
python runtests.py --bench-compare parent/main time_floor_divide_int
  before           after         ratio
[3a61a14b]       [f505827a]
<enh_simd_signed_division~10>     <enh_simd_signed_division>
  105±0.07μs      23.9±0.06μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
     109±3μs      23.7±0.04μs     0.22  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   135±0.6μs       23.9±0.8μs     0.18  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   136±0.6μs      23.5±0.05μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
  97.5±0.5μs      8.89±0.08μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
  98.1±0.5μs       8.88±0.2μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
     116±3μs      8.82±0.01μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
     118±2μs      8.90±0.09μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   142±0.7μs      5.04±0.08μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
     142±1μs       4.92±0.2μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   154±0.5μs       4.86±0.2μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
   154±0.4μs      4.84±0.01μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
   152±0.2μs       4.75±0.3μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
     149±7μs      4.62±0.03μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
     146±7μs       4.52±0.1μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
     152±5μs       4.45±0.1μs     0.03  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
SSE41
export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int
 before           after         ratio
[623bc1fa]       [a2c5af9c]
<enh_simd_signed_division~10>       <enh_simd_signed_division>
 106±0.4μs       57.3±0.2μs     0.54  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   106±4μs       56.5±0.3μs     0.53  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
 135±0.7μs         57.9±2μs     0.43  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   140±4μs       57.4±0.6μs     0.41  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
97.9±0.3μs         14.7±1μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
   102±9μs         14.7±2μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   115±2μs         13.8±1μs     0.12  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
   118±2μs       13.6±0.2μs     0.12  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   141±1μs      6.15±0.04μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
   149±6μs       6.47±0.2μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   148±6μs      6.33±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
   147±6μs      6.14±0.02μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
   156±2μs       6.44±0.5μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
   157±2μs       6.46±0.3μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
 152±0.6μs       6.19±0.1μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
   152±1μs      6.15±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
SSE3
export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int
 before           after         ratio
[623bc1fa]       [a2c5af9c]
 105±0.3μs       56.2±0.2μs     0.54  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
   113±9μs       56.2±0.2μs     0.50  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
   140±4μs       56.2±0.2μs     0.40  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
   143±4μs       56.5±0.1μs     0.39  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
   103±5μs      17.0±0.08μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   107±4μs      17.3±0.07μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
 116±0.7μs      17.1±0.08μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
   117±1μs      17.0±0.08μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
   156±6μs         7.13±4μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   148±7μs       6.76±0.4μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
   144±2μs      6.41±0.05μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
   150±4μs      6.51±0.09μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
   157±2μs       6.69±0.1μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
  174±30μs         7.39±2μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
 152±0.8μs      6.41±0.03μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
   153±2μs      6.43±0.06μs     0.04  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) 

Benchmark

VSX2
python runtests.py --bench-compare parent/main time_floor_divide_int
  before           after         ratio
[290a0345]       [2c393340]
<main>           <enh_simd_signed_division>
 64.2±0.2μs       60.8±0.2μs     0.95  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
 64.3±0.2μs       60.7±0.2μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
 64.5±0.2μs       60.0±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
 64.5±0.2μs       59.9±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
 66.3±0.2μs      15.2±0.01μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
 66.3±0.3μs       15.1±0.7μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
 66.3±0.1μs      15.1±0.01μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
 66.4±0.2μs      15.1±0.02μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
66.0±0.07μs      9.86±0.02μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
66.2±0.08μs      9.84±0.01μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
72.3±0.09μs      9.84±0.03μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
72.5±0.09μs      9.85±0.02μs     0.14  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
 64.7±0.3μs      6.82±0.02μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
 65.0±0.2μs      6.83±0.01μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
 70.5±0.2μs      6.83±0.01μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
 70.9±0.2μs      6.83±0.01μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)

AArch64

CPU
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
OS
Linux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

NEON
python runtests.py --bench-compare parent/main time_floor_divide_int
  before           after         ratio
[3a61a14b]       [f505827a]
<enh_simd_signed_division~10>       <enh_simd_signed_division>
     109±3μs         98.2±2μs     0.90  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
     112±3μs         97.7±2μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
     117±1μs       89.6±0.7μs     0.77  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
     117±2μs         89.2±1μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
     107±3μs       30.9±0.6μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
     110±3μs       31.2±0.6μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
     111±1μs       31.1±0.5μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
   112±0.2μs       31.1±0.6μs     0.28  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
     104±3μs       17.7±0.3μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
     106±2μs       17.9±0.5μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
   107±0.8μs       17.6±0.4μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
     111±3μs       17.9±0.4μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
     104±3μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
     107±1μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
     112±3μs       10.8±0.3μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
     113±1μs       10.8±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)

Initialed benchmark(before the latest modifications runned on different X86/HW than the above one)
Basic Benchmarks:
 
CPU dispatch  :                                                                                                                                                                                    
  Requested   : 'max -xop -fma4'                                                                                                                                                                   
  Enabled     : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL                                                     
  Generated   : 

=================================

       before           after         ratio
     [635b3ad8]       [cf41b9c4]
     <main>           <enh_simd_signed_division>
-      33.9±0.5μs      9.84±0.03μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
-      33.9±0.5μs      9.82±0.03μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
-      40.3±0.4μs       9.82±0.1μs     0.24  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
-      40.2±0.4μs      9.78±0.07μs     0.24  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
-        28.0±1μs      4.55±0.03μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
-        28.2±1μs      4.55±0.02μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
-      42.3±0.4μs      4.59±0.01μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
-      42.3±0.3μs      4.58±0.02μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
-        30.7±1μs      2.44±0.01μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
-      32.7±0.3μs      2.43±0.02μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
-      34.6±0.8μs      2.53±0.01μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
-      34.9±0.3μs      2.51±0.02μs     0.07  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-      45.6±0.4μs      2.49±0.07μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
-      45.7±0.1μs      2.41±0.01μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
-      51.3±0.7μs      2.53±0.03μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-      51.6±0.3μs      2.51±0.01μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Apr 16, 2021
@Qiyu8
Copy link
Member

Qiyu8 commented Apr 16, 2021

Each dispatched cpu feature should have a corresponding benchmark.

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_signed_division branch from 5e9589c to f96211e Compare April 21, 2021 16:31
@charris
Copy link
Member

charris commented Apr 22, 2021

close/reopen

@charris charris closed this Apr 22, 2021
@charris charris reopened this Apr 22, 2021
@ganesh-k13 ganesh-k13 force-pushed the enh_simd_signed_division branch from e2e6908 to 4dc28da Compare May 2, 2021 13:13
@ganesh-k13
Copy link
Member Author

ganesh-k13 commented May 11, 2021

Whats the difference between TIMEDELTA_mq_m_divide and TIMEDELTA_mm_q_floor_divide? Cause TIMEDELTA_mm_q_floor_divide is not being used. Also it seems to be #define here:

#define TIMEDELTA_mq_m_floor_divide TIMEDELTA_mq_m_divide

EDIT: My Bad, saw that one of them is mm and that's the input types.

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_signed_division branch from 4dc28da to 7651c58 Compare May 11, 2021 16:09
@ganesh-k13
Copy link
Member Author

@seiko2plus what say we keep this PR only for signed types? I'll take up timedelta in a new PR as it's slightly different with a need for NAT check and also will add benchmarks, grouping them in a single PR(along with the removal of libdivide). Let me know your thoughts on this.

@seiko2plus
Copy link
Member

@ganesh-k13,

what say we keep this PR only for signed types

I think it's fine to handle the remained work in a another pr.

@mattip
Copy link
Member

mattip commented May 14, 2021

@seiko2plus, @Qiyu8 still have open "changes requested", among them to re-run the relevant benchmarks.

@seiko2plus seiko2plus force-pushed the enh_simd_signed_division branch from f505827 to b5193d7 Compare May 20, 2021 21:20
  - Revert unsigned integer division changes,
    overflow check only required by signed division.
  - Fix floor round for reduce divison
  - cleanup
  - revert fixme comment
@seiko2plus seiko2plus force-pushed the enh_simd_signed_division branch from b5193d7 to 4619081 Compare May 20, 2021 21:24
  When the divisor is equal to the minimum integer value,
  It was affected by gcc 9.3 only and under certain conditions
  of aggressive optimization.
@seiko2plus seiko2plus force-pushed the enh_simd_signed_division branch from e4f575d to f74f500 Compare May 21, 2021 03:24
@seiko2plus seiko2plus changed the title ENH: SIMD signed division dispatch ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics May 21, 2021
@seiko2plus
Copy link
Member

I have made several fixes/improvements explained in the latest commit log and have also deleted the overflow-test patch since it's already included by PR #19046.

@seiko2plus
Copy link
Member

New benchmark results have added to the PR description.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you Ganesh.

@mattip mattip merged commit 7de0fa9 into numpy:main May 21, 2021
@mattip
Copy link
Member

mattip commented May 21, 2021

Thanks @ganesh-k13, @seiko2plus. The speedups are quite nice.

@ganesh-k13
Copy link
Member Author

Thanks a lot, @seiko2plus!

@seberg
Copy link
Member

seberg commented Oct 20, 2021

Does anyone here have time to look into gh-20025? It seems like a pretty bad bug, and we should make sure to fix it before the 1.22 release at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants