Skip to content

ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) #21124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 27, 2022

Conversation

rafaelcfsousa
Copy link
Contributor

This PR optimizes the NumPy operations: (1) divmod, (2) floor_divide, (3) fmod, and (4) remainder for VSX4/Power10.

In summary, the optimization consists in vectorizing the operations listed above to use the new integer vector modulo/division instructions that are available in the ISA 3.1 (Power10/VSX4).

You will notice that I moved the operations listed above from loops.c.src to loops_arithmetic.dispatch.c.src (except floor_divide).
See the reasons for this below:
(1) reuse the code of a recent PR (#20976)
(2) use the CPU features detection (VSX4)
(3) use the universal intrinsics

See below results from a benchmarking I ran on a Power10 machine:

numpy.fmod

  • arr OP arr
    • [signed ] speedup of up to 1.17x
    • [unsigned] speedup of up to 1.13x
dtype 100 1000 10000 100000 1000000
int8 1.01 1.03 1.06 1.07 1.07
int16 1.04 1.04 1.10 1.12 1.12
int32 1.03 1.04 1.11 1.13 1.13
int64 1.02 0.98 0.96 0.96 0.96
uint8 1.00 1.01 1.08 1.10 1.09
uint16 1.02 1.07 1.13 1.15 1.15
uint32 1.01 1.07 1.15 1.16 1.17
uint64 0.99 0.99 0.98 0.98 0.98
  • arr OP scalar
    • [signed ] speedup of up to 1.34x
    • [unsigned] speedup of up to 1.29x
dtype 100 1000 10000 100000 1000000
int8 1.06 1.09 1.24 1.27 1.29
int16 1.06 1.12 1.26 1.31 1.34
int32 1.06 1.09 1.23 1.28 1.28
int64 1.02 1.02 1.03 1.04 1.04
uint8 1.06 1.07 1.20 1.26 1.27
uint16 1.05 1.08 1.21 1.28 1.29
uint32 1.05 1.08 1.21 1.28 1.29
uint64 1.04 1.03 1.03 1.03 1.03

numpy.remainder

  • arr OP arr
    • [signed ] speedup of up to 4.19x
    • [unsigned] speedup of up to 1.17x
dtype 100 1000 10000 100000 1000000
int8 1.09 1.28 3.43 3.96 4.06
int16 1.07 1.44 3.77 4.13 4.19
int32 1.03 1.18 3.47 3.83 3.85
int64 1.05 1.08 2.78 3.01 3.00
uint8 0.98 1.02 1.07 1.09 1.09
uint16 1.02 1.07 1.12 1.15 1.15
uint32 0.99 1.06 1.15 1.16 1.17
uint64 0.98 0.99 0.98 0.98 0.98
  • arr OP scalar
    • [signed ] speedup of up to 4.87x
    • [unsigned] speedup of up to 1.29x
dtype 100 1000 10000 100000 1000000
int8 1.10 1.43 4.07 4.79 4.87
int16 1.11 1.44 3.98 4.72 4.80
int32 1.08 1.28 3.84 4.53 4.00
int64 1.03 1.20 3.10 3.55 3.59
uint8 1.04 1.07 1.20 1.26 1.26
uint16 1.04 1.07 1.21 1.28 1.29
uint32 1.04 1.07 1.21 1.27 1.29
uint64 1.04 1.03 1.03 1.03 1.04

numpy.divmod

  • arr OP arr
    • [signed ] speedup of up to 4.73x
    • [unsigned] speedup of up to 1.23x
dtype 100 1000 10000 100000 1000000
int8 1.05 1.41 3.70 4.65 4.73
int16 1.04 1.50 3.64 4.16 4.24
int32 1.03 1.28 3.42 3.92 3.92
int64 0.95 0.88 1.60 1.66 1.66
uint8 1.01 1.04 1.17 1.22 1.23
uint16 1.00 1.05 1.17 1.20 1.20
uint32 1.01 1.05 1.16 1.20 1.19
uint64 0.98 0.98 0.96 0.96 0.96
  • arr OP scalar
    • [signed ] speedup of up to 5.05x
    • [unsigned] speedup of up to 1.31x
dtype 100 1000 10000 100000 1000000
int8 1.01 1.45 3.99 5.05 5.01
int16 0.99 1.47 3.95 4.93 4.43
int32 1.00 1.25 3.51 3.99 3.47
int64 0.97 0.97 1.75 1.79 1.65
uint8 1.02 1.06 1.19 1.27 1.27
uint16 1.01 1.07 1.22 1.31 1.27
uint32 1.00 1.06 1.21 1.24 1.19
uint64 0.98 1.01 1.02 1.02 1.02

numpy.floor_divide

  • arr OP arr
    • [signed ] speedup of up to 4.44x
dtype 100 1000 10000 100000 1000000
int8 1.01 1.20 3.59 4.30 4.39
int16 1.02 1.97 4.06 4.37 4.44
int32 0.98 1.62 3.49 3.79 3.82
int64 0.89 1.02 1.63 1.66 1.66

P.S.: I left 3 lint errors in my code that are the function signatures.

@mattip mattip added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Feb 27, 2022
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new optimizations should be moved into a separated dispatch-able source that only holds targets "baseline vsx4" e.g. loops_modulo.dispatch.c.src to avoid the unnecessary increase of binary size on x86 for the un-affected targets (sse41 avx2 avx512f avx512_skx) within "loops_arithmetic.dispatch.c.src ".

@rafaelcfsousa
Copy link
Contributor Author

the new optimizations should be moved into a separated dispatch-able source that only holds targets "baseline vsx4" e.g. loops_modulo.dispatch.c.src to avoid the unnecessary increase of binary size on x86 for the un-affected targets (sse41 avx2 avx512f avx512_skx) within "loops_arithmetic.dispatch.c.src ".

Hi @seiko2plus, thank you for reviewing my code.

Below are two points I would like to discuss with you:

  1. one of the operations I optimized in this PR was floor_divide (signed integer types -- array // array). This operation, in particular, has some similarities compared to divmod, which allowed me to reuse the same code to implement/optimize both operations for VSX4. I think that if I move divmod to a new dispatchable source file, I will have to duplicate a considerable part of the code there.

  2. the optimizations I applied within the dispatchable source file loops_arithmetic.dispatch.c.src are inside an ifdef macro (#if defined(NPY_HAVE_VSX4)), which means that the compiler's preprocessor will remove this code before generating the binary for the other targets.

I would like to know if even considering the points above I should still move the code to another dispatchable source file. If you still think I should apply the suggested changes, I will do it.

@charris charris closed this Apr 6, 2022
@charris charris reopened this Apr 6, 2022
@rafaelcfsousa
Copy link
Contributor Author

Hi, I would like to request a review of my code (cc: @mattip , @seiko2plus )

For this PR, specifically, I would like to know if what I said above makes sense, if not, let me know so I will apply the changes that were asked.

The errors in the CI seem to be something related to timeout. The PRs
of the first page were closed and opened (in sequence) last week.

Thanks!

@rafaelcfsousa
Copy link
Contributor Author

rafaelcfsousa commented Apr 14, 2022

Hi, I analyzed the size of the binaries generated for x86, and as stated by @seiko2plus, the contributions of this PR are indeed generating some increase in the binaries since the code moved from loops.c.src is being replicated for each of the SIMD extensions defined as targets within loops_arithmetic.dispatch.c.src (see below).

**BEFORE --> AFTER**

loops_arithmetic.dispatch.avx2.o
16K -->  26K

loops_arithmetic.dispatch.avx512f.o
17K --> 27K

loops_arithmetic.dispatch.avx512_skx.o
16K --> 26K

loops_arithmetic.dispatch.sse41.o
16K --> 26K

loops.o
1055KB --> 1046KB

I will move the operations fmod, remainder and divmod to a new dispatchable souce file as requested.

@rafaelcfsousa rafaelcfsousa reopened this Apr 14, 2022
@rafaelcfsousa rafaelcfsousa changed the title ENH: Vectorize mod/divide operations using the universal intrinsics (VSX4) ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4) Apr 16, 2022
This commit optimizes the operations below:
 - fmod (signed/unsigned integers)
 - remainder (signed/unsigned integers)
 - divmod (signed/unsigned integers)
 - floor_divide (signed integers)
using the VSX4/Power10 integer vector division/modulo instructions.

See the improvements below (maximum speedup):
 - numpy.fmod
   - arr OP arr:    signed (1.17x), unsigned (1.13x)
   - arr OP scalar: signed (1.34x), unsigned (1.29x)
 - numpy.remainder
   - arr OP arr:    signed (4.19x), unsigned (1.17x)
   - arr OP scalar: signed (4.87x), unsigned (1.29x)
 - numpy.divmod
   - arr OP arr:    signed (4.73x), unsigned (1.23x)
   - arr OP scalar: signed (5.05x), unsigned (1.31x)
 - numpy.floor_divide
   - arr OP arr:    signed (4.44x)

The times above were collected using the benchmark tool available in NumPy.
@rafaelcfsousa
Copy link
Contributor Author

Hi @seiko2plus, I modified the code as you asked.

With the new changes, the binaries generated for the other architectures now have the same size with/without this PR.

See below the same table I shared in my previous comment updated:

**BEFORE --> AFTER**

loops_arithmetic.dispatch.avx2.o
16K -->  16K

loops_arithmetic.dispatch.avx512f.o
17K --> 17K

loops_arithmetic.dispatch.avx512_skx.o
16K --> 16K

loops_arithmetic.dispatch.sse41.o
16K --> 16K

loops.o
1055KB --> 1046KB

loops_modulo.dispatch.o
0 --> 11K

Thank you! 👍

@mattip mattip merged commit 0eaa40d into numpy:main Apr 27, 2022
@mattip
Copy link
Member

mattip commented Apr 27, 2022

Thanks @rafaelcfsousa

@seiko2plus
Copy link
Member

Thank you @rafaelcfsousa, @mattip. sorry for delayed response

mattip added a commit to mattip/numpy that referenced this pull request Apr 28, 2022
ENH,SIMD: Vectorize modulo/divide using the universal intrinsics (VSX4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants