Skip to content

ENH, SIMD: Dispatch for unsigned floor division #18075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 6, 2021

Conversation

ganesh-k13
Copy link
Member

@ganesh-k13 ganesh-k13 commented Dec 27, 2020

Dispatch for unsigned floor division

Hi @seiko2plus / @Qiyu8 , I am attempting to add fast integer division using universal intrinsics. I wanted your opinion on my approach.

I have marked much of the hardcoded parts with // XXX(everything is for 16bitand only signed). Now I am able to understand the dispatch mechanism to an extent and hit the code paths through dispatch(not through below diff, I hardcoded few run_binary_simd_* for that). But there are few things that do not work with integer types like load and store.

We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?

cc: @seberg

@ganesh-k13 ganesh-k13 marked this pull request as draft December 27, 2020 10:48
@ganesh-k13 ganesh-k13 changed the title ENH,SIMD: [DRAFT] Added fast divide for SSE WIP, ENH, SIMD: [DRAFT] Added fast divide for SSE Dec 27, 2020
@seiko2plus
Copy link
Member

seiko2plus commented Dec 27, 2020

@ganesh-k13,

We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?

There's no need add any new intrinsics for memory operations, please could you share your SIMD kernel code?

EDIT: sorry I saw it, my bad.

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 79fada9 to 607fd92 Compare December 27, 2020 18:19
@ganesh-k13
Copy link
Member Author

Thanks @seiko2plus for all the info, I have addressed a few of them. Will do the rest in a while.

@ganesh-k13
Copy link
Member Author

@seiko2plus , any idea why Agner did not go for 64 bit? Should we also just do normal built-in division for 64?

@seiko2plus
Copy link
Member

any idea why Agner did not go for 64 bit?

I guess not worth it. unroll, specialize a loop under opt level 3 would be enough.

@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Jan 7, 2021

I have implemented the SSE versions, but I notice UT failing locally for float. How float?

>>> import numpy as np
>>> np.__version__
'0.3.0+24365.g2a4437284'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: divide by zero encountered in floor_divide
inf
>>>
>>> import numpy as np
>>> np.__version__
'1.19.2'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: invalid value encountered in floor_divide
nan
>>>

I gdb'd into the code, specifically FLOAT_divide, that in turn goes into loops_arithm_fp.dispatch.c.src, which I have not touched and don't see any error being set either. Any pointers here will be very helpful before I proceed to AVX

@seiko2plus
Copy link
Member

@ganesh-k13,

I gdb'd into the code, specifically FLOAT_divide, that in turn goes into loops_arithm_fp.dispatch.c.src

Nothing to do with loops_arithm_fp.dispatch.c.src, FLOAT_divide only performs true divide.
However, this issue has been fixed by #16161 since if non-zero divided by zero should return inf.

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 10, 2021
@seiko2plus seiko2plus self-assigned this Jan 10, 2021
@seiko2plus
Copy link
Member

@ganesh-k13, I created a new pr #18178 that adds fast integer division intrinsics for all SIMD extensions, it should be merged before this pr.

@ganesh-k13
Copy link
Member Author

Thanks, @seiko2plus , I'll rebase once that's merged and add the dispatches 👍

Base automatically changed from master to main March 4, 2021 02:05
@mattip
Copy link
Member

mattip commented Mar 8, 2021

gh-18178 is merged

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 2a44372 to dba8572 Compare March 8, 2021 08:53
@ganesh-k13
Copy link
Member Author

Thanks, Matti, I have rebased to use the latest code, I'll fix errors and port it to signed as well.

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 798458c to 1f3a0f5 Compare March 10, 2021 14:35
@ganesh-k13
Copy link
Member Author

The UT passes except for one case, here is a trace. Any pointers will be helpful:

(Pdb) p a                                                                                      
masked_array(data=[0, 1, 2, 3, 4, 5, 6, 7, 8, --],                                             
             mask=[False, False, False, False, False, False, False, False,                     
                   False,  True],                                                                                                                                                             
       fill_value=999999,                                                                      
            dtype=uint8)
(Pdb) z                                                                                        
masked_array(data=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],                                              
             mask=False,                                                                       
       fill_value=999999,                                                                      
            dtype=uint8)                                                                       
(Pdb) z//a                                     
masked_array(data=[--, 0, 0, 0, 0, 0, 0, 0, 0, --],                                            
             mask=[ True, False, False, False, False, False, False, False,                     
                   False,  True],                                                              
       fill_value=999999,                                                                      
            dtype=uint8)                                                                       
(Pdb) z//=a                                                                                    
(Pdb) p z                                                                                      
masked_array(data=[--, 1, 1, 1, 1, 1, 1, 1, 1, --],                                            
             mask=[ True, False, False, False, False, False, False, False,                     
                   False,  True],                                                              
       fill_value=999999,                                                                      
            dtype=uint8) 

For masked array, // and //= acts differently, anything special needs to be handled here?

@seberg
Copy link
Member

seberg commented Mar 12, 2021

I would be surprised if this is limited to masked arrays, they shouldn't really do anything special aside from making a copy and filling some value into masked parts.
Trye the same for z._data and a._data? If inplace breaks, my first thought is that there is a problem accounting for src == dst (or in ufunc terms args[0] == args[2])? const char *src might give the wrong idea in that sense, but probably there is more to it.

@ganesh-k13
Copy link
Member Author

Oh I see. There must be a gap in the test plan as other cases are passing for some reason. Yeah, the const in src might be the culprit, will see what can be done. Thanks for the info

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 1a4eb1b to 812a9aa Compare March 20, 2021 11:29
@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 812a9aa to 0717ae1 Compare March 20, 2021 11:32
@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 0717ae1 to 7d37f7e Compare March 22, 2021 15:23
@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Mar 22, 2021

New Bench, deleting old.

Dispatch:
 
########### EXT COMPILER OPTIMIZATION ###########
Platform      : 
  Architecture: x64
  Compiler    : gcc

CPU baseline  : 
  Requested   : 'min'
  Enabled     : SSE SSE2 SSE3
  Flags       : -msse -msse2 -msse3
  Extra checks: none

CPU dispatch  : 
  Requested   : 'max -xop -fma4'
  Enabled     : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
  Generated   : 
              : 
  SSE41       : SSE SSE2 SSE3 SSSE3
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1
  Extra checks: none
  Detect      : SSE SSE2 SSE3 SSSE3 SSE41
              : numpy/core/src/umath/_umath_tests.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : 
  SSE42       : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2
  Extra checks: none
  Detect      : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42
              : numpy/core/src/_simd/_simd.dispatch.c
              : 
  AVX2        : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mavx2
  Extra checks: none
  Detect      : AVX F16C AVX2
              : numpy/core/src/umath/_umath_tests.dispatch.c
              : numpy/core/src/umath/loops_arithm_fp.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : 
  (FMA3 AVX2) : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
  Extra checks: none
  Detect      : AVX F16C FMA3 AVX2
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c
              : numpy/core/src/umath/loops_trigonometric.dispatch.c
              : 
  AVX512F     : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f
  Extra checks: AVX512F_REDUCE
  Detect      : AVX512F
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_arithm_fp.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c
              : numpy/core/src/umath/loops_trigonometric.dispatch.c
              : 
  AVX512_SKX  : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq
  Extra checks: AVX512BW_MASK AVX512DQ_MASK
  Detect      : AVX512_SKX
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c

       before           after         ratio                                                                                                                                                   
     [2ccc7942]       [c78d9a0b]                                                                                                                                                              
     <main>           <enh_simd_npyv_floor_div>                                                                                                                                               
-      33.1±0.5μs       30.6±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)                                                           
-       103±0.2μs      8.36±0.06μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)                                                         
-       106±0.4μs      8.35±0.04μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)                                                          
-      64.6±0.2μs      4.19±0.04μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)                                                         
-      67.3±0.2μs      4.21±0.02μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)                                                          
-      34.0±0.4μs      2.06±0.01μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)                                                          
-      36.4±0.4μs      2.08±0.01μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)                                                           
-      45.3±0.1μs      2.49±0.05μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)                                                         
-      48.0±0.6μs      2.48±0.03μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)                                                          
                                                                                                                                                                                              
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.                                                                                                                                                   
PERFORMANCE INCREASED.                                 

@ganesh-k13 ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 7d37f7e to a2c5af9 Compare March 22, 2021 15:28
@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Mar 31, 2021

Hey @seiko2plus , any more tests/changes needed?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one thing left.

@seiko2plus
Copy link
Member

@ganesh-k13, My apologies for the delayed response, here another benchmark that covers other archs.

X86

CPU
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx
OS
Linux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0

Benchmark

AVX2
python runtests.py --bench-compare parent/main time_floor_divide_int
     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.3±0.4μs      3.31±0.07μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-      22.5±0.4μs      3.28±0.05μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        70.9±1μs      7.58±0.07μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        74.1±3μs       7.72±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-      24.8±0.4μs      2.10±0.03μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-      21.8±0.5μs      1.80±0.04μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-      25.5±0.6μs      2.05±0.06μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.2±0.2μs      1.77±0.05μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
SSE41
export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int
     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.9±0.4μs       4.50±0.2μs     0.20  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-        23.7±1μs      4.38±0.06μs     0.18  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        70.5±1μs       11.6±0.3μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        77.8±3μs       11.5±0.2μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-      22.5±0.2μs       2.05±0.2μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-      25.2±0.4μs      2.29±0.05μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.3±0.8μs      1.98±0.04μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-      25.2±0.6μs      2.19±0.03μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
SSE3
export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int
     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.8±0.5μs       4.82±0.1μs     0.21  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        22.9±1μs       4.78±0.1μs     0.21  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-        72.8±3μs       12.2±0.6μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-        71.7±1μs       12.0±0.5μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        25.7±1μs       2.43±0.1μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-      25.2±0.3μs      2.38±0.05μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.6±0.5μs      2.11±0.04μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-      22.5±0.4μs      2.10±0.03μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) 

Benchmark

VSX2
python runtests.py --bench-compare parent/main time_floor_divide_int
       before           after         ratio
     [623bc1fa]       [46b8cfc3]
                      <enh_simd_npyv_floor_div>
-     16.9±0.02μs      10.1±0.01μs     0.59  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     16.3±0.03μs      7.25±0.01μs     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     16.4±0.05μs      7.24±0.01μs     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     25.9±0.03μs      10.1±0.01μs     0.39  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     16.6±0.03μs      5.56±0.01μs     0.33  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-     16.6±0.03μs      5.55±0.02μs     0.33  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

AArch64

CPU
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
OS
Linux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

NEON
python runtests.py --bench-compare parent/main time_floor_divide_int
       before           after         ratio
     [036f6c68]       [f3eb831d]
                      <enh_simd_npyv_floor_div>
-     28.9±0.08μs      19.9±0.04μs     0.69  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     28.6±0.08μs      12.2±0.03μs     0.43  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     52.3±0.05μs      20.0±0.04μs     0.38  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     34.6±0.08μs      12.1±0.03μs     0.35  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     28.0±0.06μs      8.05±0.04μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-     28.0±0.05μs      8.04±0.05μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

Co-authored-by: Sayed Adel <seiko@imavr.com>
@ganesh-k13
Copy link
Member Author

Thanks a lot @seiko2plus!

@ganesh-k13
Copy link
Member Author

Also can I raise a PR for signed types now, @seiko2plus ? I'll purge libdivide where needed.

@mattip mattip merged commit 914407d into numpy:main Apr 6, 2021
@mattip
Copy link
Member

mattip commented Apr 6, 2021

Thanks @ganesh-k13

@seiko2plus
Copy link
Member

@ganesh-k13, sure you can, thank you!

@seberg
Copy link
Member

seberg commented Apr 11, 2021

The change the change in the loops caused a small regression for:

import operator
import numpy as np

result = operator.__floordiv__(np.array([True, False, False]), True)

conveniently found by pandas in pandas-dev/pandas#40874

I think all that is needed is to ensure the loop order is not change in the code generation chunk (unsigned loops before signed ones). And a test to make sure we are the ones who notice it next time would be nice ;). I can create a PR, but if you beat me to it, even better :).

EDIT: The current and new loops are as follows (So I got it wrong above, it is signed before unsigned – of identical precision):

In [1]: np.floor_divide.types
Out[1]: 
['bb->b',
 'BB->B',
 'hh->h',
 'HH->H',
 'ii->i',
 'II->I',
 'll->l',
 'LL->L',
 'qq->q',
 'QQ->Q',
 'ee->e',
 'ff->f',
 'dd->d',
 'gg->g',
 'FF->F',
 'DD->D',
 'GG->G',
 'mq->m',
 'md->m',
 'mm->q',
 'OO->O']

New loops:

In [1]: np.floor_divide.types
Out[1]: 
['BB->B',
 'HH->H',
 'II->I',
 'LL->L',
 'QQ->Q',
 'bb->b',
 'hh->h',
 'ii->i',
 'll->l',
 'qq->q',
 'ee->e',
 'ff->f',
 'dd->d',
 'gg->g',
 'FF->F',
 'DD->D',
 'GG->G',
 'mq->m',
 'md->m',
 'mm->q',
 'OO->O']

@ganesh-k13
Copy link
Member Author

Hey Sebastian, sorry for the regression. I have made the change locally and able to fix with your RCA. I am adding tests and will raise a PR shortly.

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request Apr 11, 2021
charris added a commit that referenced this pull request Apr 12, 2021
BUG: Regression #18075 | Fixing Ufunc TD generation order
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 25 - WIP component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants