ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

rafaelcfsousa · 2022-02-02T14:57:05Z

This PR:

Optimizes floor_divide (arr // arr) for VSX4
Adds a benchmark routine for floor_divide (arr // arr)
Adds an enhancement note as requested in my previous PR (ENH: Add CPU feature detection for POWER10 (VSX4) #20821)

Some comments:

I used raw VSX intrinsics since the universal intrinsics only supports vector division by a scalar
I did not include ulonglong since this datatype does not benefit from this contribution
The compiler was not able to vectorize the unsigned types (arr // arr) code automatically

See below some benchmarking information:

       before           after         ratio
     [6077afd6]       [4dc613ef]
     <main>           <p10_enh_intdiv>
-      60.1±0.1μs       57.2±0.1μs     0.95  ArrayDivInt(int64, 10000)
-     18.2±0.04μs       17.3±0.1μs     0.95  ArrayDivInt(uint8, 10000)
-     5.91±0.01ms      5.58±0.01ms     0.94  ArrayDivInt(int32, 1000000)
-        1.63±0ms      1.52±0.01ms     0.94  ArrayDivInt(uint8, 1000000)
-        6.96±0ms      6.48±0.01ms     0.93  ArrayDivInt(int8, 1000000)
-      60.6±0.4μs       56.3±0.5μs     0.93  ArrayDivInt(int32, 10000)
-      69.5±0.3μs       64.0±0.2μs     0.92  ArrayDivInt(int8, 10000)
-     24.3±0.02μs      21.2±0.03μs     0.87  ArrayDivInt(uint32, 10000)
-        2.25±0ms         1.94±0ms     0.86  ArrayDivInt(uint32, 1000000)
-      61.6±0.2μs       36.3±0.5μs     0.59  ScalarDivInt(int64, -43)
-      61.6±0.2μs       36.2±0.5μs     0.59  ScalarDivInt(int64, 43)
-      60.9±0.1μs       22.4±0.4μs     0.37  ScalarDivInt(int64, -8)
-      60.9±0.2μs       22.2±0.2μs     0.36  ScalarDivInt(int64, 8)

mattip · 2022-02-02T15:21:32Z

Closes #20849

seiko2plus · 2022-02-06T03:40:22Z

numpy/core/src/umath/loops_arithmetic.dispatch.c.src

+    for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,
+         dst += vstep) {
+        npyv_@sfx@ a = npyv_load_@sfx@(src1);
+        npyv_@sfx@ b = npyv_load_@sfx@(src2);
+        npyv_@sfx@ c = vsx4_div_@sfx@(a, b);
+        npyv_store_@sfx@(dst, c);
+        if (vec_any_eq(b, zero)) {
+            npy_set_floatstatus_divbyzero();
+        }
+    }


Suggested change

for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,

dst += vstep) {

npyv_@sfx@ a = npyv_load_@sfx@(src1);

npyv_@sfx@ b = npyv_load_@sfx@(src2);

npyv_@sfx@ c = vsx4_div_@sfx@(a, b);

npyv_store_@sfx@(dst, c);

if (vec_any_eq(b, zero)) {

npy_set_floatstatus_divbyzero();

}

}

npyv_@bsfx@ is_bzero = npyv_cvt_@bsfx@_@sfx@(zero);

for (; len >= vstep; len -= vstep, src1 += vstep, src2 += vstep,

dst += vstep) {

npyv_@sfx@ a = npyv_load_@sfx@(src1);

npyv_@sfx@ b = npyv_load_@sfx@(src2);

npyv_@sfx@ c = vsx4_div_@sfx@(a, b);

is_bzero = npyv_or_@bsfx@(is_bzero, npyv_cmpeq_@sfx@(b, zero));

npyv_store_@sfx@(dst, c);

}

if (!vec_all_eq(b, zero)) {

npy_set_floatstatus_divbyzero();

}

reduce jmps should increase performance. also equal and or should be faster vec_any isn't? if not then wrap vec_any_eq() by NPY_UNLIKELY().

Hi @seiko2plus, thanks for the review!

I tried to remove the control flow following your suggestion (and also using some other mechanisms), but neither one improved the execution time. For some cases, I see a slowdown between ~5-7%. As it didn't improve, I only added NPY_UNLIKELY().

seiko2plus · 2022-02-06T03:45:07Z

numpy/core/src/common/simd/intdiv.h

@@ -46,9 +46,6 @@
 *  - For 64-bit division on Aarch64 and IBM/Power, we fall-back to the scalar division
 *    since emulating multiply-high is expensive and both architectures have very fast dividers.
 *
- ** TODO:
- *   - Add support for Power10(VSX4)


add support for VSX4 should include using intrinsic vec_mulh instead of mulo, mule, and permute within intrinsics npyv_divc_##sfx see:

numpy/numpy/core/src/common/simd/vsx/arithmetic.h

Lines 108 to 110 in 0e64536

npyv_u16 mul_even = vec_mule(a, divisor.val[0]);

npyv_u16 mul_odd = vec_mulo(a, divisor.val[0]);

npyv_u8 mulhi = (npyv_u8)vec_perm(mul_even, mul_odd, mergeo_perm);

also uses vec_div() within npyv_divc_u64

numpy/numpy/core/src/common/simd/vsx/arithmetic.h

Lines 214 to 218 in 0e64536

NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor)

{

const npy_uint64 d = vec_extract(divisor.val[0], 0);

return npyv_set_u64(vec_extract(a, 0) / d, vec_extract(a, 1) / d);

}

After modifying the universal intrinsic divc following your suggestion, I see speedups between ~5-10%.

You will notice that I only modified the datatypes u32, s32, and u64. For the other data types, [su]8 and [su]16, as the POWER ISA 3.1 does not have vmulh for byte and half-word, the changes would require extra instructions (e.g., merge, pack, etc) that would lead to a slowdown of ~20-25%. Because of that, I left them without changes.

seiko2plus

LGTM, Thank you!

mattip · 2022-02-10T10:08:00Z

Thanks @rafaelcfsousa

BENCH: Add arr // arr for floor_divide

5c25a5a

github-actions bot added the 01 - Enhancement label Feb 2, 2022

rafaelcfsousa mentioned this pull request Feb 2, 2022

DOC: reminder to add a release note about Power10 VSX4 support before 1.23 #20849

Closed

mattip linked an issue Feb 2, 2022 that may be closed by this pull request

DOC: reminder to add a release note about Power10 VSX4 support before 1.23 #20849

Closed

seiko2plus reviewed Feb 6, 2022

View reviewed changes

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Feb 6, 2022

rafaelcfsousa added 3 commits February 8, 2022 10:30

ENH: Optimize floor_divide for VSX4/Power10

2b45eac

ENH: Optimize divc with VSX4/Power10 intrinsics

17f30d5

DOC: Add release note for VSX4/Power10 enablement

eed9c36

rafaelcfsousa force-pushed the p10_enh_intdiv branch from 0e64536 to eed9c36 Compare February 8, 2022 17:01

seiko2plus approved these changes Feb 10, 2022

View reviewed changes

mattip merged commit b97e7d5 into numpy:main Feb 10, 2022

rafaelcfsousa deleted the p10_enh_intdiv branch February 10, 2022 12:19

rafaelcfsousa mentioned this pull request Feb 27, 2022

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

Uh oh!

rafaelcfsousa commented Feb 2, 2022

Uh oh!

mattip commented Feb 2, 2022

Uh oh!

seiko2plus Feb 6, 2022 •

edited

Loading

Uh oh!

rafaelcfsousa Feb 8, 2022

Uh oh!

seiko2plus Feb 6, 2022

Uh oh!

rafaelcfsousa Feb 8, 2022

Uh oh!

seiko2plus left a comment

Uh oh!

mattip commented Feb 10, 2022

Uh oh!

Uh oh!

	npyv_u16 mul_even = vec_mule(a, divisor.val[0]);
	npyv_u16 mul_odd = vec_mulo(a, divisor.val[0]);
	npyv_u8 mulhi = (npyv_u8)vec_perm(mul_even, mul_odd, mergeo_perm);

	NPY_FINLINE npyv_u64 npyv_divc_u64(npyv_u64 a, const npyv_u64x3 divisor)
	{
	const npy_uint64 d = vec_extract(divisor.val[0], 0);
	return npyv_set_u64(vec_extract(a, 0) / d, vec_extract(a, 1) / d);
	}

Uh oh!

ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

ENH,BENCH: Optimize floor_divide for VSX4/Power10 #20976

Uh oh!

Conversation

rafaelcfsousa commented Feb 2, 2022

Uh oh!

mattip commented Feb 2, 2022

Uh oh!

seiko2plus Feb 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelcfsousa Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

seiko2plus Feb 6, 2022

Choose a reason for hiding this comment

Uh oh!

rafaelcfsousa Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Feb 10, 2022

Uh oh!

Uh oh!

seiko2plus Feb 6, 2022 •

edited

Loading