Skip to content

ENH: Convert tanh from C universal intrinsics to C++ using Highway #25934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 4, 2024

Conversation

sterrettm2
Copy link
Contributor

@sterrettm2 sterrettm2 commented Mar 4, 2024

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

On x86, AVX2 has a major performance improvement, while AVX512 shows a mix of small regressions and speedups.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

| Change   | Before [15691c33] <main>   | After [ddbf7be8] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| +        | 14.4±0.1μs                 | 15.5±0.01μs                   |    1.08 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| +        | 46.9±0.01μs                | 49.5±2μs                      |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| +        | 8.22±0.1μs                 | 8.44±0.05μs                   |    1.03 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| +        | 893±0.1μs                  | 899±1μs                       |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| +        | 894±0.7μs                  | 904±0.9μs                     |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| +        | 882±0.07μs                 | 883±0.5μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 178±0.04μs                 | 178±0.2μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 17.2±0.01μs                | 17.1±0.01μs                   |    0.99 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 172±0.2μs                  | 171±0.04μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 167±3μs                    | 165±0.03μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'e') |
| -        | 4.90±0.01μs                | 4.83±0μs                      |    0.98 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |
| -        | 169±3μs                    | 165±0.06μs                    |    0.98 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'e') |
| -        | 176±6μs                    | 169±0.02μs                    |    0.96 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'e') |
| -        | 19.1±1μs                   | 17.9±0.01μs                   |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |

AVX2

AVX2 shows a major performance improvement; thanks @jan-wassenberg for the idea to avoid slow gathers with the LUTs.

| Change   | Before [15691c33] <main>   | After [0267cd21] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| -        | 1.83±0.01ms                | 1.50±0ms                      |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| -        | 429±0.3μs                  | 351±0.8μs                     |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 410±0.2μs                  | 333±2μs                       |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'f') |
| -        | 415±0.2μs                  | 337±0.7μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 1.84±0.01ms                | 1.49±0ms                      |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| -        | 430±0.3μs                  | 350±0.2μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'd') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 3.46±0ms                   | 2.22±0ms                      |    0.64 | bench_ufunc.UFunc.time_ufunc_types('tanh')                               |
| -        | 117±0.01μs                 | 43.4±0.04μs                   |    0.37 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 115±0.04μs                 | 41.2±0.01μs                   |    0.36 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| -        | 498±8μs                    | 149±0.08μs                    |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| -        | 499±8μs                    | 150±0.5μs                     |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')        |
| -        | 475±8μs                    | 110±5μs                       |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'd')        |
| -        | 102±0.02μs                 | 23.1±0.02μs                   |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| -        | 475±9μs                    | 103±0.05μs                    |    0.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |
| -        | 99.7±0.02μs                | 19.7±0.01μs                   |    0.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |

@Mousius
Copy link
Member

Mousius commented Mar 5, 2024

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

AVX2

AVX2 has almost exact performance parity.

I'd suggest that the 1-10% variation could also be just due to it being low numbers of μs and the code being slightly different. We've had other similar things happen just by introducing code which leads to a few μs differences in unrelated code.

I'll run this on some AArch64, and see what we get, this is very exciting though and I look forward to reviewing this further 😸

@Mousius
Copy link
Member

Mousius commented Mar 5, 2024

cc @jan-wassenberg because the review box doesn't seem to want me to request review that way 🙀

Copy link
Contributor

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) Some suggestions:

@@ -60,7 +60,7 @@ FMA3 = mod_features.new(
test_code: files(source_root + '/numpy/distutils/checks/cpu_fma3.c')[0]
)
AVX2 = mod_features.new(
'AVX2', 25, implies: F16C, args: '-mavx2',
'AVX2', 25, implies: F16C, args: '-march=skylake',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skylake is considerably more than AVX2. To enable HWY_AVX2, -maes -march=haswell are sufficient.

using opmask_t = hn::Mask<decltype(f32)>;

struct hwy_f32x2 {
vec_f32 val[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately not portable to SVE and RISC-V. We can either use Create2 to return a native tuple (caveat: this requires relatively recent compilers, check HWY_HAVE_TUPLE), or we can just use individual in/out function args to pass the two members.

if constexpr(hn::Lanes(f64) == 8){
const vec_f64 lut0 = hn::Load(f64, lut);
const vec_f64 lut1 = hn::Load(f64, lut + 8);
return hn::TwoTablesLookupLanes(f64, lut0, lut1, hn::IndicesFromVec(f64, idx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be worthwhile to implement an AVX2 version with 4 table lookups and blending, because gather is quite slow again.

idxs = npyv_min_s32(idxs, npyv_setall_s32(0x3e00000));
npyv_u32 idx = npyv_shri_u32(npyv_reinterpret_u32_s32(idxs), 21);
auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
auto nnan_m = hn::Not(hn::IsNaN(x));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the Not by swapping the arg order in the IfThenElse where this is used.

auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
auto nnan_m = hn::Not(hn::IsNaN(x));
vec_s32 idxs = hn::Sub(ndnan, hn::Set(s32, 0x3d400000));
idxs = hn::Max(idxs, hn::Set(s32, 0)); // TODO probably faster zero method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can either use hn::Max(idxs, hn::Zero(s32)) or even hn::ZeroIfNegative(idxs).

@jan-wassenberg
Copy link
Contributor

Oh nice, great to see the AVX2 speedup, congrats!

#include "fast_loop_macros.h"


#if NPY_SIMD_FMA3 && (NPY_SIMD_F32 || NPY_SIMD_F64) // requires native FMA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highway always has MulAdd, but we should indeed check #if HWY_HAVE_FLOAT64 here. That is only false on Armv7. We can skip the previous checks, I think.

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 12, 2024
@Rohanjames1997
Copy link

Thanks for this PR!
Do you happen to know the perf impact on aarch64?

@Mousius
Copy link
Member

Mousius commented Nov 19, 2024

Thanks for the effort here @sterrettm2. I benchmarked this and got some troubling results:

| Change   | Before [f3cca787] <main>   | After [db31dce1]    |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------|---------|-----------------------------------------------------------------------------------|
| +        | 29.6±0.01μs                | 87.0±0.05μs         |    2.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')                 |
| +        | 29.5±0.02μs                | 86.9±0.04μs         |    2.94 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f')          |
| +        | 37.4±0.05μs                | 106±1μs             |    2.84 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')                 |
| +        | 37.3±0.07μs                | 106±1μs             |    2.83 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f')          |
| +        | 88.1±0.05μs                | 114±0.4μs           |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd')          |
| +        | 88.4±0.07μs                | 114±0.5μs           |    1.29 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')                 |
| +        | 84.0±0.01μs                | 103±0.2μs           |    1.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')                 |
| +        | 84.1±0.04μs                | 103±0.1μs           |    1.23 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd')          |

Which seems to indicate that gather/scatter is performing sub-optimally. I'll see if I can figure out why 🤔

@Mousius
Copy link
Member

Mousius commented Nov 19, 2024

@jan-wassenberg this looks like we might have a regression in Highway, rolling back to google/highway@f525867 results in the gather performance improving whereas with current HEAD I'm seeing the 3x slower loads.

@jan-wassenberg
Copy link
Contributor

Interesting. If I understand correctly, the Apr12 version is faster than HEAD? Changes since then to the gather functions seem to be only google/highway@e20d36d from Apr15, is that the one that makes the difference in speed?

@Mousius
Copy link
Member

Mousius commented Nov 19, 2024

@jan-wassenberg see google/highway#2382

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions 😸

// index linearly. In order to keep the same vertical calculation, we
// transpose the coef. into lanes. A 4x4 transpose is all that's
// supported so we require `npyv_nlanes_f32` == 4.
#if npyv_nlanes_f32 == 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this with if constexpr(hn::Lanes(f32) == 4 && HWY_TARGET <= HWY_SSE4){ below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, the HWY_TARGET condition means SSE4 or better. Lanes also isn't constexpr for SVE, so I think we'd have to check MaxLanes(f32) <= 4. Maybe we can drop the target check, because avoiding gather would be a win on any platform, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe on AArch64, the gather was still faster when I benchmarked it.

Copy link
Contributor Author

@sterrettm2 sterrettm2 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change; I changed the target condition to HWY_TARGET >= HWY_SSE4 which I think is correct, so AVX2 and AVX512 don't use the transposed LUTs. It'd be great if someone could double check that that's correct.

One other thing, with the way I have the code currently ordered it might be possible that both lookup tables end up in the final object, since they are above the if constexpr. I'm inclined to assume the compiler should eliminate the unused one, but I haven't verified this.

@Mousius
Copy link
Member

Mousius commented Nov 20, 2024

Overall, this seems good; I'll try to bump the highway version once the fix has landed for AArch64

@sterrettm2
Copy link
Contributor Author

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

@sterrettm2 sterrettm2 requested a review from Mousius November 28, 2024 00:41
#define TANH_TRANSPOSED_LUT
#endif // npyv_nlanes_f64 == 2
HWY_ATTR NPY_FINLINE vec_f64 lut_16_f64(const double * lut, vec_u64 idx){
if constexpr(hn::Lanes(f64) == 8){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if constexpr will only work #if !HWY_HAVE_SCALABLE. We can wrap them, up to the final } else, in an #if. Then for SVE, the compiler will only see the { GatherIndex } codepath, which is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is: Lanes is only constexpr on some targets, not SVE nor RVV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can use MaxLanes as we do below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, as we don't use scalable vectors for this routine (yet).

@jan-wassenberg
Copy link
Contributor

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

Nice. The HWY_TARGET condition looks good to me, we are selecting all 128-bit targets, and excluding half-length AVX2 or AVX-512.

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, unsure how to nudge CircleCI, maybe rebase @sterrettm2 ?

@sterrettm2
Copy link
Contributor Author

I've rebased it, and those weird Circle CI failures seem to have vanished.

@Mousius Mousius merged commit 1fcd5f8 into numpy:main Dec 4, 2024
69 checks passed
@Mousius
Copy link
Member

Mousius commented Dec 4, 2024

Thanks @sterrettm2!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants