ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

rafaelcfsousa · 2022-02-27T04:25:53Z

This PR optimizes the NumPy operations: (1) divmod, (2) floor_divide, (3) fmod, and (4) remainder for VSX4/Power10.

In summary, the optimization consists in vectorizing the operations listed above to use the new integer vector modulo/division instructions that are available in the ISA 3.1 (Power10/VSX4).

You will notice that I moved the operations listed above from loops.c.src to loops_arithmetic.dispatch.c.src (except floor_divide).
See the reasons for this below:
(1) reuse the code of a recent PR (#20976)
(2) use the CPU features detection (VSX4)
(3) use the universal intrinsics

See below results from a benchmarking I ran on a Power10 machine:

numpy.fmod

arr OP arr
- [signed ] speedup of up to 1.17x
- [unsigned] speedup of up to 1.13x

dtype	100	1000	10000	100000	1000000
int8	1.01	1.03	1.06	1.07	1.07
int16	1.04	1.04	1.10	1.12	1.12
int32	1.03	1.04	1.11	1.13	1.13
int64	1.02	0.98	0.96	0.96	0.96
uint8	1.00	1.01	1.08	1.10	1.09
uint16	1.02	1.07	1.13	1.15	1.15
uint32	1.01	1.07	1.15	1.16	1.17
uint64	0.99	0.99	0.98	0.98	0.98

arr OP scalar
- [signed ] speedup of up to 1.34x
- [unsigned] speedup of up to 1.29x

dtype	100	1000	10000	100000	1000000
int8	1.06	1.09	1.24	1.27	1.29
int16	1.06	1.12	1.26	1.31	1.34
int32	1.06	1.09	1.23	1.28	1.28
int64	1.02	1.02	1.03	1.04	1.04
uint8	1.06	1.07	1.20	1.26	1.27
uint16	1.05	1.08	1.21	1.28	1.29
uint32	1.05	1.08	1.21	1.28	1.29
uint64	1.04	1.03	1.03	1.03	1.03

numpy.remainder

arr OP arr
- [signed ] speedup of up to 4.19x
- [unsigned] speedup of up to 1.17x

dtype	100	1000	10000	100000	1000000
int8	1.09	1.28	3.43	3.96	4.06
int16	1.07	1.44	3.77	4.13	4.19
int32	1.03	1.18	3.47	3.83	3.85
int64	1.05	1.08	2.78	3.01	3.00
uint8	0.98	1.02	1.07	1.09	1.09
uint16	1.02	1.07	1.12	1.15	1.15
uint32	0.99	1.06	1.15	1.16	1.17
uint64	0.98	0.99	0.98	0.98	0.98

arr OP scalar
- [signed ] speedup of up to 4.87x
- [unsigned] speedup of up to 1.29x

dtype	100	1000	10000	100000	1000000
int8	1.10	1.43	4.07	4.79	4.87
int16	1.11	1.44	3.98	4.72	4.80
int32	1.08	1.28	3.84	4.53	4.00
int64	1.03	1.20	3.10	3.55	3.59
uint8	1.04	1.07	1.20	1.26	1.26
uint16	1.04	1.07	1.21	1.28	1.29
uint32	1.04	1.07	1.21	1.27	1.29
uint64	1.04	1.03	1.03	1.03	1.04

numpy.divmod

arr OP arr
- [signed ] speedup of up to 4.73x
- [unsigned] speedup of up to 1.23x

dtype	100	1000	10000	100000	1000000
int8	1.05	1.41	3.70	4.65	4.73
int16	1.04	1.50	3.64	4.16	4.24
int32	1.03	1.28	3.42	3.92	3.92
int64	0.95	0.88	1.60	1.66	1.66
uint8	1.01	1.04	1.17	1.22	1.23
uint16	1.00	1.05	1.17	1.20	1.20
uint32	1.01	1.05	1.16	1.20	1.19
uint64	0.98	0.98	0.96	0.96	0.96

arr OP scalar
- [signed ] speedup of up to 5.05x
- [unsigned] speedup of up to 1.31x

dtype	100	1000	10000	100000	1000000
int8	1.01	1.45	3.99	5.05	5.01
int16	0.99	1.47	3.95	4.93	4.43
int32	1.00	1.25	3.51	3.99	3.47
int64	0.97	0.97	1.75	1.79	1.65
uint8	1.02	1.06	1.19	1.27	1.27
uint16	1.01	1.07	1.22	1.31	1.27
uint32	1.00	1.06	1.21	1.24	1.19
uint64	0.98	1.01	1.02	1.02	1.02

numpy.floor_divide

arr OP arr
- [signed ] speedup of up to 4.44x

dtype	100	1000	10000	100000	1000000
int8	1.01	1.20	3.59	4.30	4.39
int16	1.02	1.97	4.06	4.37	4.44
int32	0.98	1.62	3.49	3.79	3.82
int64	0.89	1.02	1.63	1.66	1.66

P.S.: I left 3 lint errors in my code that are the function signatures.

seiko2plus

the new optimizations should be moved into a separated dispatch-able source that only holds targets "baseline vsx4" e.g. loops_modulo.dispatch.c.src to avoid the unnecessary increase of binary size on x86 for the un-affected targets (sse41 avx2 avx512f avx512_skx) within "loops_arithmetic.dispatch.c.src ".

rafaelcfsousa · 2022-03-28T12:20:12Z

the new optimizations should be moved into a separated dispatch-able source that only holds targets "baseline vsx4" e.g. loops_modulo.dispatch.c.src to avoid the unnecessary increase of binary size on x86 for the un-affected targets (sse41 avx2 avx512f avx512_skx) within "loops_arithmetic.dispatch.c.src ".

Hi @seiko2plus, thank you for reviewing my code.

Below are two points I would like to discuss with you:

one of the operations I optimized in this PR was floor_divide (signed integer types -- array // array). This operation, in particular, has some similarities compared to divmod, which allowed me to reuse the same code to implement/optimize both operations for VSX4. I think that if I move divmod to a new dispatchable source file, I will have to duplicate a considerable part of the code there.
the optimizations I applied within the dispatchable source file loops_arithmetic.dispatch.c.src are inside an ifdef macro (#if defined(NPY_HAVE_VSX4)), which means that the compiler's preprocessor will remove this code before generating the binary for the other targets.

I would like to know if even considering the points above I should still move the code to another dispatchable source file. If you still think I should apply the suggested changes, I will do it.

rafaelcfsousa · 2022-04-13T14:06:19Z

Hi, I would like to request a review of my code (cc: @mattip , @seiko2plus )

For this PR, specifically, I would like to know if what I said above makes sense, if not, let me know so I will apply the changes that were asked.

The errors in the CI seem to be something related to timeout. The PRs
of the first page were closed and opened (in sequence) last week.

Thanks!

rafaelcfsousa · 2022-04-14T19:11:26Z

Hi, I analyzed the size of the binaries generated for x86, and as stated by @seiko2plus, the contributions of this PR are indeed generating some increase in the binaries since the code moved from loops.c.src is being replicated for each of the SIMD extensions defined as targets within loops_arithmetic.dispatch.c.src (see below).

**BEFORE --> AFTER**

loops_arithmetic.dispatch.avx2.o
16K -->  26K

loops_arithmetic.dispatch.avx512f.o
17K --> 27K

loops_arithmetic.dispatch.avx512_skx.o
16K --> 26K

loops_arithmetic.dispatch.sse41.o
16K --> 26K

loops.o
1055KB --> 1046KB

I will move the operations fmod, remainder and divmod to a new dispatchable souce file as requested.

This commit optimizes the operations below: - fmod (signed/unsigned integers) - remainder (signed/unsigned integers) - divmod (signed/unsigned integers) - floor_divide (signed integers) using the VSX4/Power10 integer vector division/modulo instructions. See the improvements below (maximum speedup): - numpy.fmod - arr OP arr: signed (1.17x), unsigned (1.13x) - arr OP scalar: signed (1.34x), unsigned (1.29x) - numpy.remainder - arr OP arr: signed (4.19x), unsigned (1.17x) - arr OP scalar: signed (4.87x), unsigned (1.29x) - numpy.divmod - arr OP arr: signed (4.73x), unsigned (1.23x) - arr OP scalar: signed (5.05x), unsigned (1.31x) - numpy.floor_divide - arr OP arr: signed (4.44x) The times above were collected using the benchmark tool available in NumPy.

rafaelcfsousa · 2022-04-16T03:53:15Z

Hi @seiko2plus, I modified the code as you asked.

With the new changes, the binaries generated for the other architectures now have the same size with/without this PR.

See below the same table I shared in my previous comment updated:

**BEFORE --> AFTER**

loops_arithmetic.dispatch.avx2.o
16K -->  16K

loops_arithmetic.dispatch.avx512f.o
17K --> 17K

loops_arithmetic.dispatch.avx512_skx.o
16K --> 16K

loops_arithmetic.dispatch.sse41.o
16K --> 16K

loops.o
1055KB --> 1046KB

loops_modulo.dispatch.o
0 --> 11K

Thank you! 👍

mattip · 2022-04-27T07:12:59Z

Thanks @rafaelcfsousa

seiko2plus · 2022-04-28T01:14:45Z

Thank you @rafaelcfsousa, @mattip. sorry for delayed response

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4)

github-actions bot added the 01 - Enhancement label Feb 27, 2022

mattip added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Feb 27, 2022

seiko2plus requested changes Mar 24, 2022

View reviewed changes

charris closed this Apr 6, 2022

charris reopened this Apr 6, 2022

rafaelcfsousa closed this Apr 14, 2022

rafaelcfsousa reopened this Apr 14, 2022

rafaelcfsousa force-pushed the p10_enh_modulo branch from cb74ca6 to 9273233 Compare April 16, 2022 02:34

rafaelcfsousa changed the title ~~ENH: Vectorize mod/divide operations using the universal intrinsics (VSX4)~~ ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) Apr 16, 2022

rafaelcfsousa force-pushed the p10_enh_modulo branch from 9273233 to a14d047 Compare April 16, 2022 03:45

rafaelcfsousa requested a review from seiko2plus April 16, 2022 03:53

mattip merged commit 0eaa40d into numpy:main Apr 27, 2022

mattip added a commit to mattip/numpy that referenced this pull request Apr 28, 2022

Merge pull request numpy#21124 from rafaelcfsousa/p10_enh_modulo

4c9e6a8

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4)

ganesh-k13 mentioned this pull request May 7, 2022

BUG: Integer overflow in remainder for Windows #19260

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

Uh oh!

rafaelcfsousa commented Feb 27, 2022

Uh oh!

seiko2plus left a comment

Uh oh!

rafaelcfsousa commented Mar 28, 2022

Uh oh!

rafaelcfsousa commented Apr 13, 2022

Uh oh!

rafaelcfsousa commented Apr 14, 2022 •

edited

Loading

Uh oh!

rafaelcfsousa commented Apr 16, 2022

Uh oh!

mattip commented Apr 27, 2022

Uh oh!

seiko2plus commented Apr 28, 2022

Uh oh!

Uh oh!

Uh oh!

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

Uh oh!

Conversation

rafaelcfsousa commented Feb 27, 2022

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

rafaelcfsousa commented Mar 28, 2022

Uh oh!

rafaelcfsousa commented Apr 13, 2022

Uh oh!

rafaelcfsousa commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafaelcfsousa commented Apr 16, 2022

Uh oh!

mattip commented Apr 27, 2022

Uh oh!

seiko2plus commented Apr 28, 2022

Uh oh!

Uh oh!

rafaelcfsousa commented Apr 14, 2022 •

edited

Loading