-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes #20849 |
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep, | ||
dst += vstep) { | ||
npyv_@sfx@ a = npyv_load_@sfx@(src1); | ||
npyv_@sfx@ b = npyv_load_@sfx@(src2); | ||
npyv_@sfx@ c = vsx4_div_@sfx@(a, b); | ||
npyv_store_@sfx@(dst, c); | ||
if (vec_any_eq(b, zero)) { | ||
npy_set_floatstatus_divbyzero(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep, | |
dst += vstep) { | |
npyv_@sfx@ a = npyv_load_@sfx@(src1); | |
npyv_@sfx@ b = npyv_load_@sfx@(src2); | |
npyv_@sfx@ c = vsx4_div_@sfx@(a, b); | |
npyv_store_@sfx@(dst, c); | |
if (vec_any_eq(b, zero)) { | |
npy_set_floatstatus_divbyzero(); | |
} | |
} | |
npyv_@bsfx@ is_bzero = npyv_cvt_@bsfx@_@sfx@(zero); | |
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep, | |
dst += vstep) { | |
npyv_@sfx@ a = npyv_load_@sfx@(src1); | |
npyv_@sfx@ b = npyv_load_@sfx@(src2); | |
npyv_@sfx@ c = vsx4_div_@sfx@(a, b); | |
is_bzero = npyv_or_@bsfx@(is_bzero, npyv_cmpeq_@sfx@(b, zero)); | |
npyv_store_@sfx@(dst, c); | |
} | |
if (!vec_all_eq(b, zero)) { | |
npy_set_floatstatus_divbyzero(); | |
} |
reduce jmps should increase performance. also equal
and or
should be faster vec_any
isn't? if not then wrap vec_any_eq()
by NPY_UNLIKELY()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @seiko2plus, thanks for the review!
I tried to remove the control flow following your suggestion (and also using some other mechanisms), but neither one improved the execution time. For some cases, I see a slowdown between ~5-7%. As it didn't improve, I only added NPY_UNLIKELY()
.
@@ -46,9 +46,6 @@ | |||
* - For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division | |||
* since emulating multiply-high is expensive and both architectures have very fast dividers. | |||
* | |||
** TODO: | |||
* - Add support for Power10(VSX4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add support for VSX4 should include using intrinsic vec_mulh
instead of mulo, mule, and permute
within intrinsics npyv_divc_##sfx
see:
numpy/numpy/core/src/common/simd/vsx/arithmetic.h
Lines 108 to 110 in 0e64536
npyv_u16 mul_even = vec_mule(a, divisor.val[0]); | |
npyv_u16 mul_odd = vec_mulo(a, divisor.val[0]); | |
npyv_u8 mulhi = (npyv_u8)vec_perm(mul_even, mul_odd, mergeo_perm); |
also uses vec_div()
within npyv_divc_u64
numpy/numpy/core/src/common/simd/vsx/arithmetic.h
Lines 214 to 218 in 0e64536
NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor) | |
{ | |
const npy_uint64 d = vec_extract(divisor.val[0], 0); | |
return npyv_set_u64(vec_extract(a, 0) / d, vec_extract(a, 1) / d); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After modifying the universal intrinsic divc
following your suggestion, I see speedups between ~5-10%.
You will notice that I only modified the datatypes u32, s32, and u64. For the other data types, [su]8 and [su]16, as the POWER ISA 3.1 does not have vmulh
for byte and half-word, the changes would require extra instructions (e.g., merge, pack, etc) that would lead to a slowdown of ~20-25%. Because of that, I left them without changes.
0e64536
to
eed9c36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!
Thanks @rafaelcfsousa |
This PR:
Some comments:
See below some benchmarking information: