-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: libdivide for unsigned integers #18055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -926,7 +931,7 @@ def test_log_values(self): | |||
assert_raises(FloatingPointError, np.log, np.float32(-np.inf)) | |||
assert_raises(FloatingPointError, np.log, np.float32(-1.0)) | |||
|
|||
# See https://github.com/numpy/numpy/issues/18005 | |||
# See https://github.com/numpy/numpy/issues/18005 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My .vimrc autocorrected this, can undo and rebase if required, please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me, simple repetition of the pattern for signed ints, thanks.
Just some small comments/question and I am slightly worried about long long being larger than 64bits on some platforms. I am not aware that exists, but maybe we should just put a #error
in anyway, to be sure?
@@ -926,7 +931,7 @@ def test_log_values(self): | |||
assert_raises(FloatingPointError, np.log, np.float32(-np.inf)) | |||
assert_raises(FloatingPointError, np.log, np.float32(-1.0)) | |||
|
|||
# See https://github.com/numpy/numpy/issues/18005 | |||
# See https://github.com/numpy/numpy/issues/18005 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bother, if it isn't your editor, the next persons editor will jus do it.. We could try to do this in a dedicated STY commit, but honest, I don't think we are good about that for small stuff, and I am not sure it is worth it.
max_value = 10**7 | ||
min_value = -10**7 | ||
max_value = 10**2 | ||
min_value = -10**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this change now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in my previous PR, that small values itself show the changes properly, so I reduced it to make it memory and time friendly. I can make it 7 also, please let me know
* #type = npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong# | ||
* #c = u,u,u,ul,ull# | ||
* #TYPE = UBYTE, USHORT, UINT, ULONG, ULONGLONG# | ||
* #TYPE2 = BYTE, SHORT, INT, LONG, LONGLONG# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might name it SIGNED_TYPE? just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks yeah, really did not know what to call it :)
#define libdivide_@type@_t libdivide_u64_t | ||
#define libdivide_@type@_gen libdivide_u64_gen | ||
#define libdivide_@type@_do libdivide_u64_do | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized, if there is ever hardware with long long > 64bit we are going to have a problem. If there is hardware like that, we have to fix this in the 1.20 release. If we are not worried about such existence, we still should probably anticipate it, or at least put a #error "Uhoh!"
in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks. I'll see if I can test it.
Interesting that the speed improvement is much smaller than for the signed versions, makes me almost wonder if libdivide even does much, or it is mostly the loop specialization for |
const @type@ in1 = *(@type@ *)ip1; | ||
BINARY_DEFS | ||
|
||
/* When the divisor is a constant, use libdivide for faster division */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think libdivide can beat modern compilers, have you tried to specialize scalar divisor if (steps[1] == 0)
under optimization level 3 without libdivide?
Have you tried to test it on another architectures too? have you tried to unroll the loop by four scalars per each iteration?
NOTE: Arm & IBM/Power have decent instruction-sets for scalar divison even better than the x86 ones idiv, div
.
If you still think libdivide
can perform better than native hardware support then maybe you should try Agner Fog implementation, it has a decent emulated SSE/AVX intrinsics show better performance than x86 scalar instruction-set when the divisor is scalar. see vectori128.h, vectori256.h, and vectori512.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I think we tried for the signed division, but it may be we forgot to ensure optimization level 3 is used. We should do that. Otherwise, I would be fine with doing this as a baseline for better approaches, which I am sure exist and would be nice to have. But if you think this is just churn, we can also just not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info @seiko2plus , I agree that SSE and AVX will give a performance boost that is better than raw libdivide. What I was hoping to do is that we can get a baseline version as Sebastian mentioned above, and then expose the intrinsics later. ref: #17925. I'll try optimisations of constant loops without libdivide yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked libdivide docs & source code and it seems they never test it on Power, my main concern here is ppc64 since it has direct native support for unsigned and signed 64-bit & 32-bit division. also, Agner Fog didn't see a need to emulate SSE/AVX intrinsics for the 64-bit division, he only supports 8-bit, 16-bit, and 32-bit.
we can get a baseline version as Sebastian mentioned above
for what reason? almost all machines nowadays have SIMD support. I think we can just bring Agner Fog integer division intrinsics to NPYV, and removes libdivide. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @seiko2plus , will be happy to try out Agner Fog's VCL for unsigned here and then port to signed if it's good. I noticed all the implementations are in CPP, apologies If it's a basic question: Is it possible to use a CPP lib here? Or do we build some sort of wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use a CPP lib here? Or do we build some sort of wrapper?
neither both, we gonna have re-implement them for example on SSE
you can implements 16-bit division as following:
// encapsulate parameters for fast division on vector of 8 16-bit signed integers
NPY_INLINE npyv_s16x3 npyv_divisor_s16(npy_int16 a)
{
const int d1 = abs(d);
int sh, m;
if (d1 > 1) {
// TODO: implment npy_bit_scan_reverse_u32
sh = (int)npy_bit_scan_reverse_u32(d1-1); // shift count = ceil(log2(d1))-1 = (bit_scan_reverse(d1-1)+1)-1
m = ((1 << (16 + sh)) / d1 - ((1 << 16) - 1)); // calculate multiplier
}
else {
m = 1; // for d1 = 1
sh = 0;
if (d == 0) {
m /= d; // provoke error here if d = 0
}
if (d == 0x8000) { // fix overflow for this special case
m = 0x8001;
sh = 14;
}
}
npyv_s16x3 divisor;
divisor.val[0] = _mm_set1_epi16((npy_int16)m); // broadcast multiplier
divisor.val[1] = _mm_setr_epi32(sh, 0, 0, 0); // shift count
divisor.val[2] = _mm_set1_epi32(d < 0 ? -1 : 0); // sign of divisor
return divisor;
}
NPY_FINLINE npyv_s16 npyv_divc_s16(npyv_s16 a, const npyv_s16x3 divisor)
{
__m128i t1 = _mm_mulhi_epi16(a, divisor.val[0]); // multiply high signed words
__m128i t2 = _mm_add_epi16(t1, a); // + a
__m128i t3 = _mm_sra_epi16(t2, divisor.val[1]); // shift right arithmetic
__m128i t4 = _mm_srai_epi16(a, 15); // sign of a
__m128i t5 = _mm_sub_epi16(t4, divisor.val[2]); // sign of a - sign of d
__m128i t6 = _mm_sub_epi16(t3, t5); // + 1 if a < 0, -1 if d < 0
return _mm_xor_si128(t6, divisor.val[2]); // change sign if divisor negative
}
Its almost the same code of VLC but pretty fit to NPYV. I can help you to with the VSX & NEON if you're not familiar with it.
take a look at #17790 to see how we implments new intrinsics, for the runtime dispatcher see #17985 or #17587(minmal one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed info @seiko2plus , I shall start work on this 👍
What must be noted is the cause of speedup in signed versions:
Now each change contributed to a near equal amount of speedup. But in case of unsigned, there is no |
Using SIMD in #18178 |
Changes:
Benchmarks:
Benchmarks:
Note: I saw unsigned ints was missed when I was working on scalar fastpaths, so added it here.
Continues: #17727