Skip to content

ENH: Improve the performance of einsum by using universal simd #17049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 40 commits into from

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Aug 11, 2020

Rebased from #16641 in order to start a cleanup review , The optimization resulted in a performance improvement of 10%~77% Here is the benchmark result :

  • X86 machine

  • AVX512F Enabled

       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         149±3μs          104±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         238±6μs          156±5μs     0.65  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         247±8μs          149±6μs     0.60  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-        440±10μs         245±10μs     0.56  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
  • AVX2 Enabled
       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         154±8μs          106±5μs     0.69  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         239±6μs          160±4μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        256±10μs          138±4μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         450±5μs          225±4μs     0.50  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
  • SSE2 Enabled
       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         145±4μs          107±2μs     0.74  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         240±2μs          161±6μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        448±10μs          247±5μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-        253±10μs          137±6μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
  • ARM machine

  • NEON Enabled

       before           after         ratio
     [7aced6e5]       [ad0b3b42]
     <master>         <einsum-usimd>
-      22.0±0.7ms       20.0±0.3ms     0.91  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
-       109±0.9μs       99.1±0.8μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
-         111±1μs          100±1μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
-       159±0.8μs          142±2μs     0.90  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
-       112±0.6μs         96.3±2μs     0.86  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
-       112±0.6μs       95.2±0.3μs     0.85  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
-       136±0.4μs        115±0.5μs     0.85  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
-         518±5μs          363±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        773±20μs         479±10μs     0.62  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
-      17.0±0.4ms       9.77±0.2ms     0.57  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
-         873±4μs          350±4μs     0.40  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-         554±2μs          200±2μs     0.36  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-        1.02±0ms          233±4μs     0.23  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Co-Author: @seiko2plus

Qiyu8 added 4 commits August 11, 2020 12:10
Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd
@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 11, 2020

@seiko2plus Can you add Power8/9 benchmark result here? Feel free to add more benchmark test cases.

Qiyu8 added 3 commits August 12, 2020 10:48
Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd
@eric-wieser
Copy link
Member

eric-wieser commented Aug 12, 2020

Here is the benchmark result :

What compiler version are you using? I'm a little worried that we're manually optimizing things that a clever compiler can do for us, but we're benchmarking on a compiler that doesn't. At a glance, GCC 10 emits very similar code to this manual vectorization for a simple for loop.

I think all SIMD benchmarks at a minimum need to state the compiler version for the "before", and should probably include a column with the latest-and-greated compiler for comparison.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 12, 2020

in X86 platform, I'm using MSVC Compiler (version 14.26.28801), with the args /arch o2
in ARM platform, I'm using GCC 7.3.0, with -O3 arg
I think that these compilers already do the auto-vector optimization, but is less efficient than manually optimizing.

@eric-wieser
Copy link
Member

eric-wieser commented Aug 12, 2020

It looks like MSVC 16.3 adds the autovectorized AVX-512 support, so it might be worth trying with that for comparison.
For vectorization to kick in, /fp:fast and/or -ffast-math are required.

@Qiyu8 Qiyu8 added 01 - Enhancement component: numpy.einsum component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Aug 13, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 13, 2020

I think you mean Visual Studio version 16.3, my Visual studio version is 16.4, so AVX-512 auto vectorizer support is available, Here is the Benchmark result with /arch:AVX512 enabled:

  • X86 Platform AVX512F enabled
       before           after         ratio
     [00a45b4d]       [ae53e350]
     <master>         <einsum-usimd>
-         144±3μs          101±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         234±6μs          155±2μs     0.66  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         238±1μs          139±3μs     0.59  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         430±4μs          235±8μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@Qiyu8 Qiyu8 requested a review from mattip August 14, 2020 06:47
Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few last comments.

Are the benchmark results at the top of the PR current for the last changeset? I think 2e713b0 is the last commit. Does the size of _multiarray_umath.so change?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rev[1/4], improve reduce the sum on X86(SSE, AVX2)

seiko2plus added a commit to seiko2plus/numpy that referenced this pull request Aug 19, 2020
  This patch doesn't cause any performance changes,
  it just aims to simplify the review process for numpy#17049,
  according to numpy#17049 (comment)
@eric-wieser
Copy link
Member

eric-wieser commented Aug 21, 2020

This should be good to rebase now. You should be able to delete einsum_sumprod.c.src and replace it with your current einsum.dispatch.c.src (with some tweaks to the includes).

@r-devulap
Copy link
Member

From the first glance on the benchmark numbers, on x86 platform it seems to me that AVX2 and AVX-512 isn't providing any speed up relative to SSE. Is it worth adding extra code in the library for no benefit?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Sep 19, 2020

@mattip some small arrays got radio of 1.05 after running multiple times, which I think is caused by normal performance jitter.

@mattip mattip requested a review from eric-wieser September 20, 2020 19:50
@Qiyu8
Copy link
Member Author

Qiyu8 commented Sep 28, 2020

@eric-wieser Any other Suggestions. Thanks

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
@Qiyu8
Copy link
Member Author

Qiyu8 commented Oct 13, 2020

@mattip @eric-wieser The last commits are comments and typos modification, no impact on performance.

@mattip mattip requested a review from eric-wieser October 13, 2020 06:38
@Qiyu8 Qiyu8 added the triage review Issue/PR to be discussed at the next triage meeting label Oct 22, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Oct 26, 2020

I can split this PR to 9 following PRs if It's too large to merge.

  1. Add sum intrinsics for all platform.
  2. Add einsum benchmark cases.
  3. Optimizing the sum_of_products_contig_two.
  4. Optimizing the sum_of_products_stride0_contig_outcontig_two.
  5. Optimizing the sum_of_products_contig_stride0_outcontig_two.
  6. Optimizing the sum_of_products_contig_contig_outstride0_two.
  7. Optimizing the sum_of_products_stride0_contig_outstride0_two.
  8. Optimizing the sum_of_products_contig_stride0_outstride0_two.
  9. Optimizing the sum_of_products_contig_outstride0_one.

@Qiyu8 Qiyu8 marked this pull request as draft October 30, 2020 03:16
@mattip
Copy link
Member

mattip commented Nov 4, 2020

PRs to do (1) and (2) have been merged.

@eric-wieser
Copy link
Member

What's the status of this PR? Did all the components end up as separate PRs? If so, can we link them here then close this?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jan 20, 2021

@eric-wieser The final part is about to come.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jan 21, 2021

Mission Accomplished, closing now.

@Qiyu8 Qiyu8 closed this Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy.einsum component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants