-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
SIMD: add fast integer division intrinsics for all supported platforms #18178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1cee85a
to
0f87688
Compare
af0e706
to
b99768b
Compare
I think this should have an attribution, something like
@rgommers is that correct? |
No that sounds wrong. Reminder, we don't do Apache2, see for example: #13447 (comment). We could change that at some point, but it's a significant change and needs a strong motivation and decision on the mailing list. VCL is a little hard to find, so here is a link: https://github.com/vectorclass/version2. If this PR took code from there, that seems like a blocker for accepting it. |
Now read the PR description, the whole PR is based on that, and it seems valuable. It may be worth considering, especially if there are no good alternatives. We'd be deciding to give up on GPLv2 compatibility. |
@rgommers, we can ask for re-licensing, since we're not taking the same exact code plus the original code itself is based on T. Granlund and P. L. Montgomery work. |
Most of the code was rewritten, while some of the code is very similar to Apache 2.0 based VSL, which is not compatible with current 3-clause BSD License(Apache 2.0 is more restrictive), so we should either rewrite the similar code or ask Agner Fog modify the License. |
b99768b
to
d98d927
Compare
9a6bd67
to
c4b67b7
Compare
261a56c
to
d4fa7dd
Compare
d4fa7dd
to
41a7cdd
Compare
41a7cdd
to
6d9d969
Compare
@ganesh-k13, I added intrinsics for the 64-bit division so we can drop libdiv. |
cd96539
to
0a3ea2a
Compare
0a3ea2a
to
48690a0
Compare
I added a missing case when the divisor equals |
It seems there is not a test for divisor == 0? Or maybe that case is not needed, so maybe add a comment where coverage is complaining that the code path should never be reached |
c21c61b
to
5d2c57f
Compare
5d2c57f
to
8aae310
Compare
@mattip, calling |
Thanks @seiko2plus |
Hey @seiko2plus , I was trying to add dispatch for signed and in the case of signed 64 bits, the "round down" is not happening. Here is the code used to test it: diff. s8, s16, and s32 seem to work fine. Just confirming from bt:
The last |
@ganesh-k13, this patch follows the C compilers behavior for the signed division which mean: Here an example in Python for implementing floor division on the top truncate division: """
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
−dsign = SRL(d, N − 1);
−nsign = (n < −dsign );
/* SLT, signed */
−qsign = EOR(−nsign , −dsign );
q = TRUNC((n − (−dsign ) + (−nsign ))/d) − (−qsign );
"""
from numpy.core._simd import baseline as v
for d in [-100, -1, -50, 100, 1, 100]:
nsign_d = v.setall_s16(int(d < 0))
divisor = v.divisor_s16(d)
noverflow = v.cvt_b16_s16(v.setall_s16(-1))
for n in [-0x8000, -255, 255, -100, 1, 100]:
a = v.load_s16(range(n, n + v.nlanes_s16))
nsign_a = v.cvt_s16_b16(v.cmplt_s16(a, nsign_d))
nsign_a = v.and_s16(nsign_a, v.setall_s16(1))
diff_sign = v.sub_s16(nsign_a, nsign_d)
to_ninf = v.xor_s16(nsign_a, nsign_d)
trunc = v.divc_s16(v.add_s16(a, diff_sign), divisor)
if d == -1:
greater_min = v.cmpgt_s16(a, v.setall_s16(-0x8000))
noverflow = v.and_b16(noverflow, greater_min)
floor = v.ifsub_s16(greater_min, trunc, to_ninf, v.zero_s16())
else:
floor = v.sub_s16(trunc, to_ninf)
assert floor == [x//d for x in a]
if v.tobits_b16(noverflow) != (1 << v.nlanes_s16)-1:
0//0 |
Thanks a lot for the info @seiko2plus. As python prefers round to -inf, shall I replace/add the npyv_divc_s* with round to -inf as shown in paper? Will be happy to raise a PR with round to -inf. [EDIT] Or do we build on top of trunc as shown above? [EDIT 2] My main question is, do we edit npyv_divc_s* or add code in the dispatch to modify the divisor. |
@ganesh-k13, just handle it directly within the dispatch-able source file, note we will still need to use truncate division for |
Add fast integer division intrinsics for all supported platforms
merge before #18075
Almost all architecture (except Power10) doesn't support integer vector division,
also the cost of scalar division in architectures like x86 is too high it can take
30 to 40 cycles on modern chips and up to 100 on old ones.
Therefore we are using division by multiplying with precomputed reciprocal technique,
the method that been used in this implementation is based on T. Granlund and P. L. Montgomery
“Division by invariant integers using multiplication(see [Figure 4.1, 5.1]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556)
It shows a good impact for all architectures especially on X86,
however computing divisor parameters is kind of expensive so this implementation
should only works when divisor is a scalar and used multiple of times.
The division process is separated into two intrinsics for each data type
npyv_{dtype}x3 npyv_divisor_{dtype} ({dtype} divisor);
For computing the divisor parameters (multiplier + shifters + sign of divisor(signed only))
npyv_{dtype} npyv_divisor_{dtype} (npyv_{dtype} dividend, npyv_{dtype}x3 divisor_parms);
For performing the final division.
For example:
NOTES:
since emulating multiply-high is expensive and both architectures have very fast dividers.
TODO: