Skip to content

ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 10, 2022

Conversation

rafaelcfsousa
Copy link
Contributor

This PR:

Some comments:

  • I used raw VSX intrinsics since the universal intrinsics only supports vector division by a scalar
  • I did not include ulonglong since this datatype does not benefit from this contribution
  • The compiler was not able to vectorize the unsigned types (arr // arr) code automatically

See below some benchmarking information:

       before           after         ratio
     [6077afd6]       [4dc613ef]
     <main>           <p10_enh_intdiv>
-      60.1±0.1μs       57.2±0.1μs     0.95  ArrayDivInt(int64, 10000)
-     18.2±0.04μs       17.3±0.1μs     0.95  ArrayDivInt(uint8, 10000)
-     5.91±0.01ms      5.58±0.01ms     0.94  ArrayDivInt(int32, 1000000)
-        1.63±0ms      1.52±0.01ms     0.94  ArrayDivInt(uint8, 1000000)
-        6.96±0ms      6.48±0.01ms     0.93  ArrayDivInt(int8, 1000000)
-      60.6±0.4μs       56.3±0.5μs     0.93  ArrayDivInt(int32, 10000)
-      69.5±0.3μs       64.0±0.2μs     0.92  ArrayDivInt(int8, 10000)
-     24.3±0.02μs      21.2±0.03μs     0.87  ArrayDivInt(uint32, 10000)
-        2.25±0ms         1.94±0ms     0.86  ArrayDivInt(uint32, 1000000)
-      61.6±0.2μs       36.3±0.5μs     0.59  ScalarDivInt(int64, -43)
-      61.6±0.2μs       36.2±0.5μs     0.59  ScalarDivInt(int64, 43)
-      60.9±0.1μs       22.4±0.4μs     0.37  ScalarDivInt(int64, -8)
-      60.9±0.2μs       22.2±0.2μs     0.36  ScalarDivInt(int64, 8)

@mattip
Copy link
Member

mattip commented Feb 2, 2022

Closes #20849

Comment on lines 154 to 163
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
dst += vstep) {
npyv_@sfx@ a = npyv_load_@sfx@(src1);
npyv_@sfx@ b = npyv_load_@sfx@(src2);
npyv_@sfx@ c = vsx4_div_@sfx@(a, b);
npyv_store_@sfx@(dst, c);
if (vec_any_eq(b, zero)) {
npy_set_floatstatus_divbyzero();
}
}
Copy link
Member

@seiko2plus seiko2plus Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
dst += vstep) {
npyv_@sfx@ a = npyv_load_@sfx@(src1);
npyv_@sfx@ b = npyv_load_@sfx@(src2);
npyv_@sfx@ c = vsx4_div_@sfx@(a, b);
npyv_store_@sfx@(dst, c);
if (vec_any_eq(b, zero)) {
npy_set_floatstatus_divbyzero();
}
}
npyv_@bsfx@ is_bzero = npyv_cvt_@bsfx@_@sfx@(zero);
for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
dst += vstep) {
npyv_@sfx@ a = npyv_load_@sfx@(src1);
npyv_@sfx@ b = npyv_load_@sfx@(src2);
npyv_@sfx@ c = vsx4_div_@sfx@(a, b);
is_bzero = npyv_or_@bsfx@(is_bzero, npyv_cmpeq_@sfx@(b, zero));
npyv_store_@sfx@(dst, c);
}
if (!vec_all_eq(b, zero)) {
npy_set_floatstatus_divbyzero();
}

reduce jmps should increase performance. also equal and or should be faster vec_any isn't? if not then wrap vec_any_eq() by NPY_UNLIKELY().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @seiko2plus, thanks for the review!

I tried to remove the control flow following your suggestion (and also using some other mechanisms), but neither one improved the execution time. For some cases, I see a slowdown between ~5-7%. As it didn't improve, I only added NPY_UNLIKELY().

@@ -46,9 +46,6 @@
* - For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division
* since emulating multiply-high is expensive and both architectures have very fast dividers.
*
** TODO:
* - Add support for Power10(VSX4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add support for VSX4 should include using intrinsic vec_mulh instead of mulo, mule, and permute within intrinsics npyv_divc_##sfx see:

npyv_u16 mul_even = vec_mule(a, divisor.val[0]);
npyv_u16 mul_odd = vec_mulo(a, divisor.val[0]);
npyv_u8 mulhi = (npyv_u8)vec_perm(mul_even, mul_odd, mergeo_perm);

also uses vec_div() within npyv_divc_u64

NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor)
{
const npy_uint64 d = vec_extract(divisor.val[0], 0);
return npyv_set_u64(vec_extract(a, 0) / d, vec_extract(a, 1) / d);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After modifying the universal intrinsic divc following your suggestion, I see speedups between ~5-10%.

You will notice that I only modified the datatypes u32, s32, and u64. For the other data types, [su]8 and [su]16, as the POWER ISA 3.1 does not have vmulh for byte and half-word, the changes would require extra instructions (e.g., merge, pack, etc) that would lead to a slowdown of ~20-25%. Because of that, I left them without changes.

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Feb 6, 2022
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

@mattip mattip merged commit b97e7d5 into numpy:main Feb 10, 2022
@mattip
Copy link
Member

mattip commented Feb 10, 2022

Thanks @rafaelcfsousa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: reminder to add a release note about Power10 VSX4 support before 1.23
3 participants