Skip to content

ENH: Add SIMD versions of bool logical_&&,||,! and absolute #22167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Aug 23, 2022

NumPy has SIMD versions of BOOL logical_and, logical_or, logical_not, and absolute for SSE2. The changes here replace that implementation with one that uses universal intrinsics. This allows other architectures to have SIMD versions of the functions too.

BOOL logical_and and logical_or are particularly important for NumPy as that's how np.any() / np.all() are implemented.

Apple M1: up to 16.5x faster

       before           after         ratio
     [7c143834]       [c49d9dc2]
     <main>           <logical/dev>
-        3.47±0μs      1.82±0.01μs     0.52  bench_reduce.AnyAll.time_all_slow
-     4.05±0.01μs      1.83±0.06μs     0.45  bench_reduce.AnyAll.time_any_slow
-        6.55±0μs          532±2ns     0.08  bench_ufunc.Custom.time_not_bool
-        10.2±0μs          680±7ns     0.07  bench_ufunc.Custom.time_or_bool
-     11.0±0.07μs          665±3ns     0.06  bench_ufunc.Custom.time_and_bool
                                                                                                                                                                    
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Apple M1 (Rosetta): up to 1.6x faster

       before           after         ratio
     [7c143834]       [1ad2f2f2]
     <main>           <logical/dev>
-     1.14±0.01μs         1.03±0μs     0.90  bench_ufunc.Custom.time_not_bool
-     4.38±0.03μs      3.06±0.02μs     0.70  bench_reduce.AnyAll.time_any_slow
-     5.20±0.01μs      3.17±0.01μs     0.61  bench_reduce.AnyAll.time_all_slow

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

iMac Pro (AVX512): up to 1.2x faster

       before           after         ratio
     [da6297b9]       [c49d9dc2]
     <main>           <logical/dev>
-     1.42±0.02μs      1.24±0.03μs     0.87  bench_ufunc.Custom.time_and_bool
-     4.16±0.03μs       3.60±0.1μs     0.86  bench_reduce.AnyAll.time_any_slow
-     1.21±0.03μs      1.03±0.02μs     0.86  bench_ufunc.Custom.time_not_bool
-      4.30±0.1μs      3.56±0.07μs     0.83  bench_reduce.AnyAll.time_all_slow

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Aug 24, 2022
Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I will defer to @seiko2plus for the deeper SIMD review. A couple of general questions for all the latest SIMD PRs:

  • could you check the binary file size change of _multiarray_umath*.so after vs. before?
  • Did the change have an effect on any other benchmarks besides the obvious ones that you reported
  • If possible, could you run the benchmarks on a non-avx512 x86_64 system

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. For fair optimization and to avoid two separate implementations for reduction, new universal intrinsic any/all going to be implemented through pr #22306.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

I merged in the suggestions made to adopt changes in #22306 (didn't exist at the time of this PR authorship)

…nd absolute

NumPy has SIMD versions of BOOL `logical_and`, `logical_or`, `logical_not`, and `absolute` for SSE2.  The changes here replace that implementation with one that uses their universal intrinsics.  This allows other architectures to have SIMD versions of the functions too.

BOOL `logical_and` and `logical_or` are particularly important for NumPy as that's how  `np.any()` / `np.all()` are implemented.
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering force-pushed the add_simd_bool_logical_andornot_absolute branch from d1ff56b to b067a42 Compare December 7, 2022 07:01
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, just one nit more.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Requested changes implemented thanks!

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, It seems I wasn't entirely focused. One thing more.

@seiko2plus seiko2plus self-assigned this Dec 8, 2022
@seiko2plus seiko2plus force-pushed the add_simd_bool_logical_andornot_absolute branch 3 times, most recently from 7bb525f to 468a3da Compare December 8, 2022 17:01
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you.
I have made some additional changes to umath generator to enable runtime dispatching for the following alias:

BOOL_invert, BOOL_add, BOOL_bitwise_and
BOOL_bitwise_or, BOOL_logical_xor
BOOL_bitwise_xor, BOOL_multiply
BOOL_maximum, BOOL_minimum, BOOL_fmax,
BOOL_fmin

@seiko2plus seiko2plus force-pushed the add_simd_bool_logical_andornot_absolute branch from 468a3da to c47e5ff Compare December 8, 2022 17:27
  Following functions are defined by umath generator
  to enable runtime dispatching without the need
  to redefine them within dsipatch-able sources:

    BOOL_invert, BOOL_add, BOOL_bitwise_and
    BOOL_bitwise_or, BOOL_logical_xor
    BOOL_bitwise_xor, BOOL_multiply
    BOOL_maximum, BOOL_minimum, BOOL_fmax,
    BOOL_fmin
@seiko2plus seiko2plus force-pushed the add_simd_bool_logical_andornot_absolute branch from c47e5ff to bfa444d Compare December 8, 2022 17:31
@seiko2plus
Copy link
Member

The last push after the approval to satisfy python linter and to fix dispatch karg on umath generator for logical_xor, both errors were caused by my latest patch.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@mattip Anything else missing/required?

@mattip mattip merged commit 78a499d into numpy:main Dec 15, 2022
@mattip
Copy link
Member

mattip commented Dec 15, 2022

@mattip
Copy link
Member

mattip commented Dec 15, 2022

@Developer-Ecosystem-Engineering if the benchmarks at the top of the PR are not the latest ones, could you post updated ones here as a footnote to this PR?

@tylerjereddy
Copy link
Contributor

Note that this showed up as a bit naughty in git bisect from gh-22845.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants