Skip to content

ENH: libdivide for unsigned integers #18055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ganesh-k13
Copy link
Member

Changes:

  1. Added libdivide for unsinged types
  2. Added bench and UT for above

Benchmarks:

Benchmarks:
 
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Installing e1cb8503  into virtualenv-py3.7-Cython
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For numpy commit 11cba182  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 25.00%] ··· Running (bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int--).
[ 25.00%] · For numpy commit e1cb8503  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 50.00%] ··· Running (bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int--).
[ 50.00%] · For numpy commit e1cb8503  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 75.00%] ··· ...omScalarFloorDivideInt.time_floor_divide_int                 ok
[ 75.00%] ··· ============== ========== =============
                  dtype       divisors               
              -------------- ---------- -------------
                numpy.int8       8       1.54±0.01μs 
                numpy.int8       -8        1.52±0μs  
                numpy.int8       43      1.68±0.01μs 
                numpy.int8      -43      1.67±0.01μs 
                numpy.int8       0       2.28±0.02μs 
               numpy.int16       8       1.60±0.01μs 
               numpy.int16       -8      1.58±0.02μs 
               numpy.int16       43      1.66±0.02μs 
               numpy.int16      -43      1.69±0.05μs 
               numpy.int16       0       2.38±0.04μs 
               numpy.int32       8       1.60±0.01μs 
               numpy.int32       -8      1.61±0.01μs 
               numpy.int32       43      1.69±0.01μs 
               numpy.int32      -43      1.71±0.01μs 
               numpy.int32       0       2.42±0.01μs 
               numpy.int64       8       1.53±0.01μs 
               numpy.int64       -8      1.52±0.01μs 
               numpy.int64       43      1.59±0.02μs 
               numpy.int64      -43      1.59±0.01μs 
               numpy.int64       0       2.29±0.02μs 
               numpy.uint8       8       1.32±0.01μs 
               numpy.uint8       -8      1.72±0.01μs 
               numpy.uint8       43      1.33±0.02μs 
               numpy.uint8      -43      1.78±0.05μs 
               numpy.uint8       0       2.29±0.04μs 
               numpy.uint16      8       1.37±0.03μs 
               numpy.uint16      -8      1.74±0.02μs 
               numpy.uint16      43      1.37±0.02μs 
               numpy.uint16     -43      1.82±0.02μs 
               numpy.uint16      0       2.35±0.06μs 
               numpy.uint32      8       1.43±0.03μs 
               numpy.uint32      -8      1.65±0.03μs 
               numpy.uint32      43      1.44±0.03μs 
               numpy.uint32     -43      1.69±0.03μs 
               numpy.uint32      0       2.44±0.04μs 
               numpy.uint64      8       1.47±0.03μs 
               numpy.uint64      -8      3.04±0.03μs 
               numpy.uint64      43      1.49±0.01μs 
               numpy.uint64     -43      2.81±0.03μs 
               numpy.uint64      0       2.45±0.05μs 
              ============== ========== =============

[ 75.00%] ···· For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor

[ 75.00%] · For numpy commit 11cba182  (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[100.00%] ··· ...omScalarFloorDivideInt.time_floor_divide_int                 ok
[100.00%] ··· ============== ========== =============
                  dtype       divisors               
              -------------- ---------- -------------
                numpy.int8       8         1.53±0μs  
                numpy.int8       -8      1.54±0.02μs 
                numpy.int8       43      1.65±0.01μs 
                numpy.int8      -43      1.66±0.01μs 
                numpy.int8       0       2.26±0.04μs 
               numpy.int16       8       1.59±0.01μs 
               numpy.int16       -8      1.62±0.01μs 
               numpy.int16       43      1.61±0.03μs 
               numpy.int16      -43      1.68±0.02μs 
               numpy.int16       0       2.36±0.04μs 
               numpy.int32       8       1.63±0.02μs 
               numpy.int32       -8      1.61±0.01μs 
               numpy.int32       43        1.68±0μs  
               numpy.int32      -43      1.70±0.01μs 
               numpy.int32       0       2.42±0.05μs 
               numpy.int64       8       1.51±0.02μs 
               numpy.int64       -8      1.49±0.01μs 
               numpy.int64       43      1.58±0.01μs 
               numpy.int64      -43      1.58±0.02μs 
               numpy.int64       0       2.27±0.02μs 
               numpy.uint8       8       1.51±0.02μs 
               numpy.uint8       -8        1.73±0μs  
               numpy.uint8       43      1.51±0.04μs 
               numpy.uint8      -43      1.80±0.01μs 
               numpy.uint8       0       2.44±0.02μs 
               numpy.uint16      8       1.62±0.02μs 
               numpy.uint16      -8      1.77±0.03μs 
               numpy.uint16      43      1.60±0.01μs 
               numpy.uint16     -43        1.84±0μs  
               numpy.uint16      0       2.48±0.02μs 
               numpy.uint32      8       1.64±0.01μs 
               numpy.uint32      -8      1.63±0.02μs 
               numpy.uint32      43      1.64±0.02μs 
               numpy.uint32     -43      1.69±0.01μs 
               numpy.uint32      0       2.51±0.01μs 
               numpy.uint64      8       1.68±0.02μs 
               numpy.uint64      -8      3.01±0.05μs 
               numpy.uint64      43      1.68±0.01μs 
               numpy.uint64     -43      2.82±0.02μs 
               numpy.uint64      0       2.57±0.01μs 
              ============== ========== =============

[100.00%] ···· For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor
               
               For parameters: , 0
               /home/ganesh/open-source/numpy/benchmarks/benchmarks/bench_ufunc.py:150: RuntimeWarning: divide by zero encountered in floor_divide
                 self.x // divisor

       before           after         ratio
     [11cba182]       [e1cb8503]
              
-     2.48±0.02μs      2.35±0.06μs     0.95  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 0)
-     2.44±0.02μs      2.29±0.04μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 0)
-     1.68±0.01μs      1.49±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.64±0.02μs      1.44±0.03μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.51±0.04μs      1.33±0.02μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.68±0.02μs      1.47±0.03μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.51±0.02μs      1.32±0.01μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.64±0.01μs      1.43±0.03μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)
-     1.60±0.01μs      1.37±0.02μs     0.85  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 43)
-     1.62±0.02μs      1.37±0.03μs     0.85  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(, 8)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
********************************************************************************
WARNING: you have uncommitted changes --- these will NOT be benchmarked!
********************************************************************************

Note: I saw unsigned ints was missed when I was working on scalar fastpaths, so added it here.

Continues: #17727

@@ -926,7 +931,7 @@ def test_log_values(self):
assert_raises(FloatingPointError, np.log, np.float32(-np.inf))
assert_raises(FloatingPointError, np.log, np.float32(-1.0))

# See https://github.com/numpy/numpy/issues/18005
# See https://github.com/numpy/numpy/issues/18005
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My .vimrc autocorrected this, can undo and rebase if required, please let me know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, simple repetition of the pattern for signed ints, thanks.

Just some small comments/question and I am slightly worried about long long being larger than 64bits on some platforms. I am not aware that exists, but maybe we should just put a #error in anyway, to be sure?

@@ -926,7 +931,7 @@ def test_log_values(self):
assert_raises(FloatingPointError, np.log, np.float32(-np.inf))
assert_raises(FloatingPointError, np.log, np.float32(-1.0))

# See https://github.com/numpy/numpy/issues/18005
# See https://github.com/numpy/numpy/issues/18005
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.

max_value = 10**7
min_value = -10**7
max_value = 10**2
min_value = -10**2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this change now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in my previous PR, that small values itself show the changes properly, so I reduced it to make it memory and time friendly. I can make it 7 also, please let me know

* #type = npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong#
* #c = u,u,u,ul,ull#
* #TYPE = UBYTE, USHORT, UINT, ULONG, ULONGLONG#
* #TYPE2 = BYTE, SHORT, INT, LONG, LONGLONG#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might name it SIGNED_TYPE? just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yeah, really did not know what to call it :)

#define libdivide_@type@_t libdivide_u64_t
#define libdivide_@type@_gen libdivide_u64_gen
#define libdivide_@type@_do libdivide_u64_do
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized, if there is ever hardware with long long > 64bit we are going to have a problem. If there is hardware like that, we have to fix this in the 1.20 release. If we are not worried about such existence, we still should probably anticipate it, or at least put a #error "Uhoh!" in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks. I'll see if I can test it.

@seberg
Copy link
Member

seberg commented Dec 23, 2020

Interesting that the speed improvement is much smaller than for the signed versions, makes me almost wonder if libdivide even does much, or it is mostly the loop specialization for stride == 0. But since it is fairly simple and identical to the signed loop, I don't mind doing this even if the speedup is only moderate.

const @type@ in1 = *(@type@ *)ip1;
BINARY_DEFS

/* When the divisor is a constant, use libdivide for faster division */
Copy link
Member

@seiko2plus seiko2plus Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think libdivide can beat modern compilers, have you tried to specialize scalar divisor if (steps[1] == 0) under optimization level 3 without libdivide?

Have you tried to test it on another architectures too? have you tried to unroll the loop by four scalars per each iteration?

NOTE: Arm & IBM/Power have decent instruction-sets for scalar divison even better than the x86 ones idiv, div.

If you still think libdivide can perform better than native hardware support then maybe you should try Agner Fog implementation, it has a decent emulated SSE/AVX intrinsics show better performance than x86 scalar instruction-set when the divisor is scalar. see vectori128.h, vectori256.h, and vectori512.h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think we tried for the signed division, but it may be we forgot to ensure optimization level 3 is used. We should do that. Otherwise, I would be fine with doing this as a baseline for better approaches, which I am sure exist and would be nice to have. But if you think this is just churn, we can also just not do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info @seiko2plus , I agree that SSE and AVX will give a performance boost that is better than raw libdivide. What I was hoping to do is that we can get a baseline version as Sebastian mentioned above, and then expose the intrinsics later. ref: #17925. I'll try optimisations of constant loops without libdivide yeah.

Copy link
Member

@seiko2plus seiko2plus Dec 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked libdivide docs & source code and it seems they never test it on Power, my main concern here is ppc64 since it has direct native support for unsigned and signed 64-bit & 32-bit division. also, Agner Fog didn't see a need to emulate SSE/AVX intrinsics for the 64-bit division, he only supports 8-bit, 16-bit, and 32-bit.

we can get a baseline version as Sebastian mentioned above

for what reason? almost all machines nowadays have SIMD support. I think we can just bring Agner Fog integer division intrinsics to NPYV, and removes libdivide. what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @seiko2plus , will be happy to try out Agner Fog's VCL for unsigned here and then port to signed if it's good. I noticed all the implementations are in CPP, apologies If it's a basic question: Is it possible to use a CPP lib here? Or do we build some sort of wrapper?

Copy link
Member

@seiko2plus seiko2plus Dec 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use a CPP lib here? Or do we build some sort of wrapper?

neither both, we gonna have re-implement them for example on SSE
you can implements 16-bit division as following:

// encapsulate parameters for fast division on vector of 8 16-bit signed integers
NPY_INLINE npyv_s16x3 npyv_divisor_s16(npy_int16 a)
{
    const int d1 = abs(d);
    int sh, m;
    if (d1 > 1) {
       // TODO: implment npy_bit_scan_reverse_u32
        sh = (int)npy_bit_scan_reverse_u32(d1-1); // shift count = ceil(log2(d1))-1 = (bit_scan_reverse(d1-1)+1)-1
        m = ((1 << (16 + sh)) / d1 - ((1 << 16) - 1)); // calculate multiplier
    }
    else {
        m = 1;                                                  // for d1 = 1
        sh = 0;
        if (d == 0) {
            m /= d;  // provoke error here if d = 0
        }
        if (d == 0x8000) { // fix overflow for this special case
            m = 0x8001;
            sh = 14;
        }
    }
    npyv_s16x3 divisor;
    divisor.val[0] = _mm_set1_epi16((npy_int16)m);           // broadcast multiplier
    divisor.val[1] = _mm_setr_epi32(sh, 0, 0, 0);            // shift count
    divisor.val[2] = _mm_set1_epi32(d < 0 ? -1 : 0);         // sign of divisor
    return divisor;
}

NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor)
{
    __m128i t1 = _mm_mulhi_epi16(a, divisor.val[0]);   // multiply high signed words
    __m128i t2 = _mm_add_epi16(t1, a);                 // + a
    __m128i t3 = _mm_sra_epi16(t2, divisor.val[1]);    // shift right arithmetic
    __m128i t4 = _mm_srai_epi16(a, 15);                // sign of a
    __m128i t5 = _mm_sub_epi16(t4, divisor.val[2]);    // sign of a - sign of d
    __m128i t6 = _mm_sub_epi16(t3, t5);                // + 1 if a < 0, -1 if d < 0
    return _mm_xor_si128(t6, divisor.val[2]);          // change sign if divisor negative
}

Its almost the same code of VLC but pretty fit to NPYV. I can help you to with the VSX & NEON if you're not familiar with it.
take a look at #17790 to see how we implments new intrinsics, for the runtime dispatcher see #17985 or #17587(minmal one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed info @seiko2plus , I shall start work on this 👍

@ganesh-k13
Copy link
Member Author

Interesting that the speed improvement is much smaller than for the signed versions

What must be noted is the cause of speedup in signed versions:

  1. Added libdivide for const loops.
  2. Removal of % operator.

Now each change contributed to a near equal amount of speedup. But in case of unsigned, there is no % to begin with.

@ganesh-k13
Copy link
Member Author

Using SIMD in #18178

@ganesh-k13 ganesh-k13 closed this Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants