ENH: Convert tanh from C universal intrinsics to C++ using Highway #25934

sterrettm2 · 2024-03-04T20:35:12Z

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

On x86, AVX2 has a major performance improvement, while AVX512 shows a mix of small regressions and speedups.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

| Change   | Before [15691c33] <main>   | After [ddbf7be8] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| +        | 14.4±0.1μs                 | 15.5±0.01μs                   |    1.08 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| +        | 46.9±0.01μs                | 49.5±2μs                      |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| +        | 8.22±0.1μs                 | 8.44±0.05μs                   |    1.03 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| +        | 893±0.1μs                  | 899±1μs                       |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| +        | 894±0.7μs                  | 904±0.9μs                     |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| +        | 882±0.07μs                 | 883±0.5μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 178±0.04μs                 | 178±0.2μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 17.2±0.01μs                | 17.1±0.01μs                   |    0.99 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 172±0.2μs                  | 171±0.04μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 167±3μs                    | 165±0.03μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'e') |
| -        | 4.90±0.01μs                | 4.83±0μs                      |    0.98 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |
| -        | 169±3μs                    | 165±0.06μs                    |    0.98 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'e') |
| -        | 176±6μs                    | 169±0.02μs                    |    0.96 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'e') |
| -        | 19.1±1μs                   | 17.9±0.01μs                   |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |

AVX2

AVX2 shows a major performance improvement; thanks @jan-wassenberg for the idea to avoid slow gathers with the LUTs.

| Change   | Before [15691c33] <main>   | After [0267cd21] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| -        | 1.83±0.01ms                | 1.50±0ms                      |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| -        | 429±0.3μs                  | 351±0.8μs                     |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 410±0.2μs                  | 333±2μs                       |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'f') |
| -        | 415±0.2μs                  | 337±0.7μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 1.84±0.01ms                | 1.49±0ms                      |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| -        | 430±0.3μs                  | 350±0.2μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'd') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 3.46±0ms                   | 2.22±0ms                      |    0.64 | bench_ufunc.UFunc.time_ufunc_types('tanh')                               |
| -        | 117±0.01μs                 | 43.4±0.04μs                   |    0.37 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 115±0.04μs                 | 41.2±0.01μs                   |    0.36 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| -        | 498±8μs                    | 149±0.08μs                    |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| -        | 499±8μs                    | 150±0.5μs                     |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')        |
| -        | 475±8μs                    | 110±5μs                       |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'd')        |
| -        | 102±0.02μs                 | 23.1±0.02μs                   |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| -        | 475±9μs                    | 103±0.05μs                    |    0.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |
| -        | 99.7±0.02μs                | 19.7±0.01μs                   |    0.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |

Mousius · 2024-03-05T13:28:28Z

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

AVX2

AVX2 has almost exact performance parity.

I'd suggest that the 1-10% variation could also be just due to it being low numbers of μs and the code being slightly different. We've had other similar things happen just by introducing code which leads to a few μs differences in unrelated code.

I'll run this on some AArch64, and see what we get, this is very exciting though and I look forward to reviewing this further 😸

Mousius · 2024-03-05T13:45:35Z

cc @jan-wassenberg because the review box doesn't seem to want me to request review that way 🙀

jan-wassenberg

Nice :) Some suggestions:

jan-wassenberg · 2024-03-06T02:13:49Z

meson_cpu/x86/meson.build

@@ -60,7 +60,7 @@ FMA3 = mod_features.new(
  test_code: files(source_root + '/numpy/distutils/checks/cpu_fma3.c')[0]
 )
 AVX2 = mod_features.new(
-  'AVX2', 25, implies: F16C, args: '-mavx2',
+  'AVX2', 25, implies: F16C, args: '-march=skylake',


Skylake is considerably more than AVX2. To enable HWY_AVX2, -maes -march=haswell are sufficient.

jan-wassenberg · 2024-03-06T02:16:35Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

+using opmask_t = hn::Mask<decltype(f32)>;
+
+struct hwy_f32x2 {
+    vec_f32 val[2];


This is unfortunately not portable to SVE and RISC-V. We can either use Create2 to return a native tuple (caveat: this requires relatively recent compilers, check HWY_HAVE_TUPLE), or we can just use individual in/out function args to pass the two members.

jan-wassenberg · 2024-03-06T02:18:36Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

+    if constexpr(hn::Lanes(f64) == 8){
+        const vec_f64 lut0 = hn::Load(f64, lut);
+        const vec_f64 lut1 = hn::Load(f64, lut + 8);
+        return hn::TwoTablesLookupLanes(f64, lut0, lut1, hn::IndicesFromVec(f64, idx));


It might actually be worthwhile to implement an AVX2 version with 4 table lookups and blending, because gather is quite slow again.

jan-wassenberg · 2024-03-06T02:20:39Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

-                 idxs = npyv_min_s32(idxs, npyv_setall_s32(0x3e00000));
-        npyv_u32 idx  = npyv_shri_u32(npyv_reinterpret_u32_s32(idxs), 21);
+        auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
+        auto nnan_m = hn::Not(hn::IsNaN(x));


We can remove the Not by swapping the arg order in the IfThenElse where this is used.

jan-wassenberg · 2024-03-06T02:21:47Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

+        auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
+        auto nnan_m = hn::Not(hn::IsNaN(x));
+        vec_s32 idxs = hn::Sub(ndnan, hn::Set(s32, 0x3d400000));
+                 idxs = hn::Max(idxs, hn::Set(s32, 0)); // TODO probably faster zero method


Yes, you can either use hn::Max(idxs, hn::Zero(s32)) or even hn::ZeroIfNegative(idxs).

jan-wassenberg · 2024-03-06T23:40:55Z

Oh nice, great to see the AVX2 speedup, congrats!

jan-wassenberg · 2024-03-07T23:03:59Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

+#include "fast_loop_macros.h"
+
+
+#if NPY_SIMD_FMA3 && (NPY_SIMD_F32 || NPY_SIMD_F64) // requires native FMA


Highway always has MulAdd, but we should indeed check #if HWY_HAVE_FLOAT64 here. That is only false on Armv7. We can skip the previous checks, I think.

Rohanjames1997 · 2024-10-09T21:56:32Z

Thanks for this PR!
Do you happen to know the perf impact on aarch64?

Mousius · 2024-11-19T12:36:07Z

Thanks for the effort here @sterrettm2. I benchmarked this and got some troubling results:

| Change   | Before [f3cca787] <main>   | After [db31dce1]    |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------|---------|-----------------------------------------------------------------------------------|
| +        | 29.6±0.01μs                | 87.0±0.05μs         |    2.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')                 |
| +        | 29.5±0.02μs                | 86.9±0.04μs         |    2.94 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f')          |
| +        | 37.4±0.05μs                | 106±1μs             |    2.84 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')                 |
| +        | 37.3±0.07μs                | 106±1μs             |    2.83 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f')          |
| +        | 88.1±0.05μs                | 114±0.4μs           |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd')          |
| +        | 88.4±0.07μs                | 114±0.5μs           |    1.29 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')                 |
| +        | 84.0±0.01μs                | 103±0.2μs           |    1.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')                 |
| +        | 84.1±0.04μs                | 103±0.1μs           |    1.23 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd')          |

Which seems to indicate that gather/scatter is performing sub-optimally. I'll see if I can figure out why 🤔

Mousius · 2024-11-19T14:27:24Z

@jan-wassenberg this looks like we might have a regression in Highway, rolling back to google/highway@f525867 results in the gather performance improving whereas with current HEAD I'm seeing the 3x slower loads.

jan-wassenberg · 2024-11-19T14:35:45Z

Interesting. If I understand correctly, the Apr12 version is faster than HEAD? Changes since then to the gather functions seem to be only google/highway@e20d36d from Apr15, is that the one that makes the difference in speed?

Mousius · 2024-11-19T18:50:43Z

@jan-wassenberg see google/highway#2382

Mousius

Just a few suggestions 😸

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

Mousius · 2024-11-19T19:00:41Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

+// index linearly.  In order to keep the same vertical calculation, we
+// transpose the coef. into lanes.  A 4x4 transpose is all that's
+// supported so we require `npyv_nlanes_f32` == 4.
+#if npyv_nlanes_f32 == 4


Can we replace this with if constexpr(hn::Lanes(f32) == 4 && HWY_TARGET <= HWY_SSE4){ below?

As written, the HWY_TARGET condition means SSE4 or better. Lanes also isn't constexpr for SVE, so I think we'd have to check MaxLanes(f32) <= 4. Maybe we can drop the target check, because avoiding gather would be a win on any platform, right?

I believe on AArch64, the gather was still faster when I benchmarked it.

I've made this change; I changed the target condition to HWY_TARGET >= HWY_SSE4 which I think is correct, so AVX2 and AVX512 don't use the transposed LUTs. It'd be great if someone could double check that that's correct.

One other thing, with the way I have the code currently ordered it might be possible that both lookup tables end up in the final object, since they are above the if constexpr. I'm inclined to assume the compiler should eliminate the unused one, but I haven't verified this.

Mousius · 2024-11-20T15:38:51Z

Overall, this seems good; I'll try to bump the highway version once the fix has landed for AArch64

sterrettm2 · 2024-11-28T00:41:42Z

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

jan-wassenberg · 2024-11-28T07:39:48Z

numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src

-    #define TANH_TRANSPOSED_LUT
-    #endif // npyv_nlanes_f64 == 2
+HWY_ATTR NPY_FINLINE vec_f64 lut_16_f64(const double * lut, vec_u64 idx){
+    if constexpr(hn::Lanes(f64) == 8){


This if constexpr will only work #if !HWY_HAVE_SCALABLE. We can wrap them, up to the final } else, in an #if. Then for SVE, the compiler will only see the { GatherIndex } codepath, which is fine.

The reason is: Lanes is only constexpr on some targets, not SVE nor RVV.

Or you can use MaxLanes as we do below.

I think this is fine for now, as we don't use scalable vectors for this routine (yet).

jan-wassenberg · 2024-11-28T07:43:31Z

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

Nice. The HWY_TARGET condition looks good to me, we are selecting all 128-bit targets, and excluding half-length AVX2 or AVX-512.

Mousius

LGTM, unsure how to nudge CircleCI, maybe rebase @sterrettm2 ?

sterrettm2 · 2024-12-04T18:18:46Z

I've rebased it, and those weird Circle CI failures seem to have vanished.

Mousius · 2024-12-04T18:24:40Z

Thanks @sterrettm2!

github-actions bot added the 01 - Enhancement label Mar 4, 2024

jan-wassenberg reviewed Mar 6, 2024

View reviewed changes

sterrettm2 force-pushed the tanh-hwy branch from ddbf7be to e713389 Compare March 6, 2024 20:33

sterrettm2 force-pushed the tanh-hwy branch from 93b0c6b to d0fb76e Compare March 7, 2024 21:11

jan-wassenberg reviewed Mar 7, 2024

View reviewed changes

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 12, 2024

luyahan mentioned this pull request Apr 3, 2024

ENH: Add RISC-V Vector V1.0 Support #26200

Open

Rohanjames1997 mentioned this pull request Oct 9, 2024

ENH: Speed up umath functions using NEON/SVE | SIMD #27533

Open

sterrettm2 force-pushed the tanh-hwy branch 2 times, most recently from 8911932 to 55a7f3a Compare November 12, 2024 23:27

Mousius mentioned this pull request Nov 19, 2024

Performance Regression in GatherIndexN google/highway#2382

Closed

Mousius requested changes Nov 20, 2024

View reviewed changes

sterrettm2 requested a review from Mousius November 28, 2024 00:41

jan-wassenberg reviewed Nov 28, 2024

View reviewed changes

Mousius approved these changes Dec 2, 2024

View reviewed changes

sterrettm2 added 4 commits December 2, 2024 13:26

ENH: Convert tanh from C universal intrinsics to C++ using Highway

73975f8

Now uses 4-way lookup table for AVX2

260200e

Fixed some #if guards and other minor changes

6286a3b

Moved more things into #if guards and cleaned up implementation

eb29a74

sterrettm2 added 6 commits December 2, 2024 13:26

Remove VX due to lack of support, minor cleanup

41e6446

Use Highway's detection for native FMA

43d9097

Remove duplication and minor cleanup

0ae0be0

Changed transposed LUT logic to use constexpr if instead of idefs

16b11fc

Clean up load types to fix build errors

9d3c046

Update highway

099c44e

sterrettm2 force-pushed the tanh-hwy branch from 209e2c6 to 099c44e Compare December 2, 2024 23:52

Mousius merged commit 1fcd5f8 into numpy:main Dec 4, 2024
69 checks passed

seberg mentioned this pull request Dec 12, 2024

MAINT: unused function warnings #27983

Closed

		#include "fast_loop_macros.h"


		#if NPY_SIMD_FMA3 && (NPY_SIMD_F32 \|\| NPY_SIMD_F64) // requires native FMA

Uh oh!

ENH: Convert tanh from C universal intrinsics to C++ using Highway #25934

ENH: Convert tanh from C universal intrinsics to C++ using Highway #25934

Uh oh!

Conversation

sterrettm2 commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AVX512

AVX2

Uh oh!

Mousius commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AVX512

AVX2

Uh oh!

Mousius commented Mar 5, 2024

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented Mar 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rohanjames1997 commented Oct 9, 2024

Uh oh!

Mousius commented Nov 19, 2024

Uh oh!

Mousius commented Nov 19, 2024

Uh oh!

jan-wassenberg commented Nov 19, 2024

Uh oh!

Mousius commented Nov 19, 2024

Uh oh!

Mousius left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sterrettm2 Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mousius commented Nov 20, 2024

Uh oh!

sterrettm2 commented Nov 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented Nov 28, 2024

Uh oh!

Mousius left a comment

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Dec 4, 2024

Uh oh!

Uh oh!

sterrettm2 commented Mar 4, 2024 •

edited

Loading

Mousius commented Mar 5, 2024 •

edited

Loading

sterrettm2 Nov 22, 2024 •

edited

Loading