ENH: libdivide for unsigned integers #18055

ganesh-k13 · 2020-12-22T11:07:35Z

Changes:

Added libdivide for unsinged types
Added bench and UT for above

Benchmarks:

 
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Installing e1cb8503  into virtualenv-py3.7-Cython
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For numpy commit 11cba182  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 25.00%] ··· Running (bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int--).
[ 25.00%] · For numpy commit e1cb8503  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 50.00%] ··· Running (bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int--).
[ 50.00%] · For numpy commit e1cb8503  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 75.00%] ··· ...omScalarFloorDivideInt.time_floor_divide_int                 ok
[ 75.00%] ··· ============== ========== =============
                  dtype       divisors               
              -------------- ---------- -------------
                numpy.int8       8       1.54±0.01μs 
                numpy.int8       -8        1.52±0μs  
                numpy.int8       43      1.68±0.01μs 
                numpy.int8      -43      1.67±0.01μs 
                numpy.int8       0       2.28±0.02μs 
               numpy.int16       8       1.60±0.01μs 
               numpy.int16       -8      1.58±0.02μs 
               numpy.int16       43      1.66±0.02μs 
               numpy.int16      -43      1.69±0.05μs 
               numpy.int16       0       2.38±0.04μs 
               numpy.int32       8       1.60±0.01μs 
               numpy.int32       -8      1.61±0.01μs 
               numpy.int32       43      1.69±0.01μs 
               numpy.int32      -43      1.71±0.01μs 
               numpy.int32       0       2.42±0.01μs 
               numpy.int64       8       1.53±0.01μs 
               numpy.int64       -8      1.52±0.01μs 
               numpy.int64       43      1.59±0.02μs 
               numpy.int64      -43      1.59±0.01μs 
               numpy.int64       0       2.29±0.02μs 
               numpy.uint8       8       1.32±0.01μs 
               numpy.uint8       -8      1.72±0.01μs 
               numpy.uint8       43      1.33±0.02μs 
               numpy.uint8      -43      1.78±0.05μs 
               numpy.uint8       0       2.29±0.04μs 
               numpy.uint16      8       1.37±0.03μs 
               numpy.uint16      -8      1.74±0.02μs 
               numpy.uint16      43      1.37±0.02μs 
               numpy.uint16     -43      1.82±0.02μs 
               numpy.uint16      0       2.35±0.06μs 
               numpy.uint32      8       1.43±0.03μs 
               numpy.uint32      -8      1.65±0.03μs 
               numpy.uint32      43      1.44±0.03μs 
               numpy.uint32     -43      1.69±0.03μs 
               numpy.uint32      0       2.44±0.04μs 
               numpy.uint64      8       1.47±0.03μs 
               numpy.uint64      -8      3.04±0.03μs 
               numpy.uint64      43      1.49±0.01μs 
               numpy.uint64     -43      2.81±0.03μs 
               numpy.uint64      0       2.45±0.05μs 
              ============== ========== =============

[ 75.00%] ···· For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor

[ 75.00%] · For numpy commit 11cba182  (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[100.00%] ··· ...omScalarFloorDivideInt.time_floor_divide_int                 ok
[100.00%] ··· ============== ========== =============
                  dtype       divisors               
              -------------- ---------- -------------
                numpy.int8       8         1.53±0μs  
                numpy.int8       -8      1.54±0.02μs 
                numpy.int8       43      1.65±0.01μs 
                numpy.int8      -43      1.66±0.01μs 
                numpy.int8       0       2.26±0.04μs 
               numpy.int16       8       1.59±0.01μs 
               numpy.int16       -8      1.62±0.01μs 
               numpy.int16       43      1.61±0.03μs 
               numpy.int16      -43      1.68±0.02μs 
               numpy.int16       0       2.36±0.04μs 
               numpy.int32       8       1.63±0.02μs 
               numpy.int32       -8      1.61±0.01μs 
               numpy.int32       43        1.68±0μs  
               numpy.int32      -43      1.70±0.01μs 
               numpy.int32       0       2.42±0.05μs 
               numpy.int64       8       1.51±0.02μs 
               numpy.int64       -8      1.49±0.01μs 
               numpy.int64       43      1.58±0.01μs 
               numpy.int64      -43      1.58±0.02μs 
               numpy.int64       0       2.27±0.02μs 
               numpy.uint8       8       1.51±0.02μs 
               numpy.uint8       -8        1.73±0μs  
               numpy.uint8       43      1.51±0.04μs 
               numpy.uint8      -43      1.80±0.01μs 
               numpy.uint8       0       2.44±0.02μs 
               numpy.uint16      8       1.62±0.02μs 
               numpy.uint16      -8      1.77±0.03μs 
               numpy.uint16      43      1.60±0.01μs 
               numpy.uint16     -43        1.84±0μs  
               numpy.uint16      0       2.48±0.02μs 
               numpy.uint32      8       1.64±0.01μs 
               numpy.uint32      -8      1.63±0.02μs 
               numpy.uint32      43      1.64±0.02μs 
               numpy.uint32     -43      1.69±0.01μs 
               numpy.uint32      0       2.51±0.01μs 
               numpy.uint64      8       1.68±0.02μs 
               numpy.uint64      -8      3.01±0.05μs 
               numpy.uint64      43      1.68±0.01μs 
               numpy.uint64     -43      2.82±0.02μs 
               numpy.uint64      0       2.57±0.01μs 
              ============== ========== =============

[100.00%] ···· For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor

       before           after         ratio
     [11cba182]       [e1cb8503]
              
-     2.48±0.02μs      2.35±0.06μs     0.95  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 0)
-     2.44±0.02μs      2.29±0.04μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 0)
-     1.68±0.01μs      1.49±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.64±0.02μs      1.44±0.03μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.51±0.04μs      1.33±0.02μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.68±0.02μs      1.47±0.03μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.51±0.02μs      1.32±0.01μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.64±0.01μs      1.43±0.03μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.60±0.01μs      1.37±0.02μs     0.85  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.62±0.02μs      1.37±0.03μs     0.85  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
********************************************************************************
WARNING: you have uncommitted changes --- these will NOT be benchmarked!
********************************************************************************

Note: I saw unsigned ints was missed when I was working on scalar fastpaths, so added it here.

Continues: #17727

ganesh-k13 · 2020-12-22T11:08:28Z

numpy/core/tests/test_umath.py

@@ -926,7 +931,7 @@ def test_log_values(self):
            assert_raises(FloatingPointError, np.log, np.float32(-np.inf))
            assert_raises(FloatingPointError, np.log, np.float32(-1.0))

-        # See https://github.com/numpy/numpy/issues/18005 
+        # See https://github.com/numpy/numpy/issues/18005


My .vimrc autocorrected this, can undo and rebase if required, please let me know

Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.

seberg

Overall, looks good to me, simple repetition of the pattern for signed ints, thanks.

Just some small comments/question and I am slightly worried about long long being larger than 64bits on some platforms. I am not aware that exists, but maybe we should just put a #error in anyway, to be sure?

seberg · 2020-12-23T19:27:10Z

numpy/core/tests/test_umath.py

@@ -926,7 +931,7 @@ def test_log_values(self):
            assert_raises(FloatingPointError, np.log, np.float32(-np.inf))
            assert_raises(FloatingPointError, np.log, np.float32(-1.0))

-        # See https://github.com/numpy/numpy/issues/18005 
+        # See https://github.com/numpy/numpy/issues/18005


Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.

seberg · 2020-12-23T21:08:52Z

benchmarks/benchmarks/bench_ufunc.py

-    max_value = 10**7
-    min_value = -10**7
+    max_value = 10**2
+    min_value = -10**2


How come this change now?

I noticed in my previous PR, that small values itself show the changes properly, so I reduced it to make it memory and time friendly. I can make it 7 also, please let me know

seberg · 2020-12-23T21:09:39Z

numpy/core/src/umath/loops.c.src

- * #type = npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong#
- * #c    = u,u,u,ul,ull#
+ * #TYPE  = UBYTE, USHORT, UINT, ULONG, ULONGLONG#
+ * #TYPE2 = BYTE, SHORT, INT, LONG, LONGLONG#


Might name it SIGNED_TYPE? just a thought.

Thanks yeah, really did not know what to call it :)

seberg · 2020-12-23T21:21:07Z

numpy/core/src/umath/loops.c.src

+#define libdivide_@type@_t libdivide_u64_t
+#define libdivide_@type@_gen libdivide_u64_gen
+#define libdivide_@type@_do libdivide_u64_do
+#endif


Just realized, if there is ever hardware with long long > 64bit we are going to have a problem. If there is hardware like that, we have to fix this in the 1.20 release. If we are not worried about such existence, we still should probably anticipate it, or at least put a #error "Uhoh!" in here.

Sounds good, thanks. I'll see if I can test it.

seberg · 2020-12-23T21:28:07Z

Interesting that the speed improvement is much smaller than for the signed versions, makes me almost wonder if libdivide even does much, or it is mostly the loop specialization for stride == 0. But since it is fairly simple and identical to the signed loop, I don't mind doing this even if the speedup is only moderate.

seiko2plus · 2020-12-23T22:48:31Z

numpy/core/src/umath/loops.c.src

-        const @type@ in1 = *(@type@ *)ip1;
+    BINARY_DEFS
+
+    /* When the divisor is a constant, use libdivide for faster division */


I don't think libdivide can beat modern compilers, have you tried to specialize scalar divisor if (steps[1] == 0) under optimization level 3 without libdivide?

Have you tried to test it on another architectures too? have you tried to unroll the loop by four scalars per each iteration?

NOTE: Arm & IBM/Power have decent instruction-sets for scalar divison even better than the x86 ones idiv, div.

If you still think libdivide can perform better than native hardware support then maybe you should try Agner Fog implementation, it has a decent emulated SSE/AVX intrinsics show better performance than x86 scalar instruction-set when the divisor is scalar. see vectori128.h, vectori256.h, and vectori512.h.

Hmmm, I think we tried for the signed division, but it may be we forgot to ensure optimization level 3 is used. We should do that. Otherwise, I would be fine with doing this as a baseline for better approaches, which I am sure exist and would be nice to have. But if you think this is just churn, we can also just not do it.

Thanks for the info @seiko2plus , I agree that SSE and AVX will give a performance boost that is better than raw libdivide. What I was hoping to do is that we can get a baseline version as Sebastian mentioned above, and then expose the intrinsics later. ref: #17925. I'll try optimisations of constant loops without libdivide yeah.

I checked libdivide docs & source code and it seems they never test it on Power, my main concern here is ppc64 since it has direct native support for unsigned and signed 64-bit & 32-bit division. also, Agner Fog didn't see a need to emulate SSE/AVX intrinsics for the 64-bit division, he only supports 8-bit, 16-bit, and 32-bit.

we can get a baseline version as Sebastian mentioned above

for what reason? almost all machines nowadays have SIMD support. I think we can just bring Agner Fog integer division intrinsics to NPYV, and removes libdivide. what do you think?

Hey @seiko2plus , will be happy to try out Agner Fog's VCL for unsigned here and then port to signed if it's good. I noticed all the implementations are in CPP, apologies If it's a basic question: Is it possible to use a CPP lib here? Or do we build some sort of wrapper?

Is it possible to use a CPP lib here? Or do we build some sort of wrapper?

neither both, we gonna have re-implement them for example on SSE
you can implements 16-bit division as following:

// encapsulate parameters for fast division on vector of 8 16-bit signed integers NPY_INLINE npyv_s16x3 npyv_divisor_s16(npy_int16 a) { const int d1 = abs(d); int sh, m; if (d1 > 1) { // TODO: implment npy_bit_scan_reverse_u32 sh = (int)npy_bit_scan_reverse_u32(d1-1); // shift count = ceil(log2(d1))-1 = (bit_scan_reverse(d1-1)+1)-1 m = ((1 << (16 + sh)) / d1 - ((1 << 16) - 1)); // calculate multiplier } else { m = 1; // for d1 = 1 sh = 0; if (d == 0) { m /= d; // provoke error here if d = 0 } if (d == 0x8000) { // fix overflow for this special case m = 0x8001; sh = 14; } } npyv_s16x3 divisor; divisor.val[0] = _mm_set1_epi16((npy_int16)m); // broadcast multiplier divisor.val[1] = _mm_setr_epi32(sh, 0, 0, 0); // shift count divisor.val[2] = _mm_set1_epi32(d < 0 ? -1 : 0); // sign of divisor return divisor; } NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor) { __m128i t1 = _mm_mulhi_epi16(a, divisor.val[0]); // multiply high signed words __m128i t2 = _mm_add_epi16(t1, a); // + a __m128i t3 = _mm_sra_epi16(t2, divisor.val[1]); // shift right arithmetic __m128i t4 = _mm_srai_epi16(a, 15); // sign of a __m128i t5 = _mm_sub_epi16(t4, divisor.val[2]); // sign of a - sign of d __m128i t6 = _mm_sub_epi16(t3, t5); // + 1 if a < 0, -1 if d < 0 return _mm_xor_si128(t6, divisor.val[2]); // change sign if divisor negative }

Its almost the same code of VLC but pretty fit to NPYV. I can help you to with the VSX & NEON if you're not familiar with it.
take a look at #17790 to see how we implments new intrinsics, for the runtime dispatcher see #17985 or #17587(minmal one).

Thanks a lot for the detailed info @seiko2plus , I shall start work on this 👍

ganesh-k13 · 2020-12-24T03:55:52Z

Interesting that the speed improvement is much smaller than for the signed versions

What must be noted is the cause of speedup in signed versions:

Added libdivide for const loops.
Removal of % operator.

Now each change contributed to a near equal amount of speedup. But in case of unsigned, there is no % to begin with.

ganesh-k13 · 2021-02-27T07:21:02Z

Using SIMD in #18178

ganesh-k13 added 3 commits December 22, 2020 16:01

ENH: Added libdivide for unsigned types

cd6d2e4

TST: Added uints test for true divide

e1cb850

BENCH: Added true divide benchmarks for uints

a075013

github-actions bot added the 01 - Enhancement label Dec 22, 2020

ganesh-k13 commented Dec 22, 2020

View reviewed changes

seberg reviewed Dec 23, 2020

View reviewed changes

seiko2plus reviewed Dec 23, 2020

View reviewed changes

ganesh-k13 closed this Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: libdivide for unsigned integers #18055

ENH: libdivide for unsigned integers #18055

ganesh-k13 commented Dec 22, 2020

ganesh-k13 Dec 22, 2020

seberg Dec 23, 2020

seberg left a comment

seberg Dec 23, 2020

seberg Dec 23, 2020

ganesh-k13 Dec 24, 2020

seberg Dec 23, 2020

ganesh-k13 Dec 24, 2020

seberg Dec 23, 2020

ganesh-k13 Dec 24, 2020

seberg commented Dec 23, 2020

seiko2plus Dec 23, 2020 •

edited

Loading

seberg Dec 23, 2020

ganesh-k13 Dec 24, 2020

seiko2plus Dec 25, 2020 •

edited

Loading

ganesh-k13 Dec 25, 2020

seiko2plus Dec 26, 2020 •

edited

Loading

ganesh-k13 Dec 26, 2020

ganesh-k13 commented Dec 24, 2020

ganesh-k13 commented Feb 27, 2021

ENH: libdivide for unsigned integers #18055

ENH: libdivide for unsigned integers #18055

Conversation

ganesh-k13 commented Dec 22, 2020

Changes:

Benchmarks:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Dec 23, 2020

seiko2plus Dec 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiko2plus Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiko2plus Dec 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganesh-k13 commented Dec 24, 2020

ganesh-k13 commented Feb 27, 2021

seiko2plus Dec 23, 2020 •

edited

Loading

seiko2plus Dec 25, 2020 •

edited

Loading

seiko2plus Dec 26, 2020 •

edited

Loading