-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Convert tanh from C universal intrinsics to C++ using Highway #25934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd suggest that the 1-10% variation could also be just due to it being low numbers of I'll run this on some AArch64, and see what we get, this is very exciting though and I look forward to reviewing this further 😸 |
cc @jan-wassenberg because the review box doesn't seem to want me to request review that way 🙀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) Some suggestions:
meson_cpu/x86/meson.build
Outdated
@@ -60,7 +60,7 @@ FMA3 = mod_features.new( | |||
test_code: files(source_root + '/numpy/distutils/checks/cpu_fma3.c')[0] | |||
) | |||
AVX2 = mod_features.new( | |||
'AVX2', 25, implies: F16C, args: '-mavx2', | |||
'AVX2', 25, implies: F16C, args: '-march=skylake', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skylake is considerably more than AVX2. To enable HWY_AVX2, -maes -march=haswell are sufficient.
using opmask_t = hn::Mask<decltype(f32)>; | ||
|
||
struct hwy_f32x2 { | ||
vec_f32 val[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunately not portable to SVE and RISC-V. We can either use Create2 to return a native tuple (caveat: this requires relatively recent compilers, check HWY_HAVE_TUPLE
), or we can just use individual in/out function args to pass the two members.
if constexpr(hn::Lanes(f64) == 8){ | ||
const vec_f64 lut0 = hn::Load(f64, lut); | ||
const vec_f64 lut1 = hn::Load(f64, lut + 8); | ||
return hn::TwoTablesLookupLanes(f64, lut0, lut1, hn::IndicesFromVec(f64, idx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might actually be worthwhile to implement an AVX2 version with 4 table lookups and blending, because gather is quite slow again.
idxs = npyv_min_s32(idxs, npyv_setall_s32(0x3e00000)); | ||
npyv_u32 idx = npyv_shri_u32(npyv_reinterpret_u32_s32(idxs), 21); | ||
auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000)); | ||
auto nnan_m = hn::Not(hn::IsNaN(x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the Not by swapping the arg order in the IfThenElse where this is used.
auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000)); | ||
auto nnan_m = hn::Not(hn::IsNaN(x)); | ||
vec_s32 idxs = hn::Sub(ndnan, hn::Set(s32, 0x3d400000)); | ||
idxs = hn::Max(idxs, hn::Set(s32, 0)); // TODO probably faster zero method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can either use hn::Max(idxs, hn::Zero(s32))
or even hn::ZeroIfNegative(idxs)
.
Oh nice, great to see the AVX2 speedup, congrats! |
#include "fast_loop_macros.h" | ||
|
||
|
||
#if NPY_SIMD_FMA3 && (NPY_SIMD_F32 || NPY_SIMD_F64) // requires native FMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highway always has MulAdd, but we should indeed check #if HWY_HAVE_FLOAT64
here. That is only false on Armv7. We can skip the previous checks, I think.
Thanks for this PR! |
8911932
to
55a7f3a
Compare
Thanks for the effort here @sterrettm2. I benchmarked this and got some troubling results:
Which seems to indicate that gather/scatter is performing sub-optimally. I'll see if I can figure out why 🤔 |
@jan-wassenberg this looks like we might have a regression in Highway, rolling back to google/highway@f525867 results in the gather performance improving whereas with current |
Interesting. If I understand correctly, the Apr12 version is faster than HEAD? Changes since then to the gather functions seem to be only google/highway@e20d36d from Apr15, is that the one that makes the difference in speed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions 😸
// index linearly. In order to keep the same vertical calculation, we | ||
// transpose the coef. into lanes. A 4x4 transpose is all that's | ||
// supported so we require `npyv_nlanes_f32` == 4. | ||
#if npyv_nlanes_f32 == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace this with if constexpr(hn::Lanes(f32) == 4 && HWY_TARGET <= HWY_SSE4){
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, the HWY_TARGET condition means SSE4 or better. Lanes also isn't constexpr for SVE, so I think we'd have to check MaxLanes(f32) <= 4. Maybe we can drop the target check, because avoiding gather would be a win on any platform, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe on AArch64, the gather was still faster when I benchmarked it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change; I changed the target condition to HWY_TARGET >= HWY_SSE4
which I think is correct, so AVX2 and AVX512 don't use the transposed LUTs. It'd be great if someone could double check that that's correct.
One other thing, with the way I have the code currently ordered it might be possible that both lookup tables end up in the final object, since they are above the if constexpr
. I'm inclined to assume the compiler should eliminate the unused one, but I haven't verified this.
Overall, this seems good; I'll try to bump the highway version once the fix has landed for AArch64 |
I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that |
#define TANH_TRANSPOSED_LUT | ||
#endif // npyv_nlanes_f64 == 2 | ||
HWY_ATTR NPY_FINLINE vec_f64 lut_16_f64(const double * lut, vec_u64 idx){ | ||
if constexpr(hn::Lanes(f64) == 8){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if constexpr will only work #if !HWY_HAVE_SCALABLE
. We can wrap them, up to the final } else
, in an #if. Then for SVE, the compiler will only see the { GatherIndex } codepath, which is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is: Lanes is only constexpr on some targets, not SVE nor RVV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can use MaxLanes as we do below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now, as we don't use scalable vectors for this routine (yet).
Nice. The HWY_TARGET condition looks good to me, we are selecting all 128-bit targets, and excluding half-length AVX2 or AVX-512. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, unsure how to nudge CircleCI, maybe rebase @sterrettm2 ?
I've rebased it, and those weird Circle CI failures seem to have vanished. |
Thanks @sterrettm2! |
This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.
On x86, AVX2 has a major performance improvement, while AVX512 shows a mix of small regressions and speedups.
AVX512
AVX512 stays within 10% of baseline, most being within 5%.
AVX2
AVX2 shows a major performance improvement; thanks @jan-wassenberg for the idea to avoid slow gathers with the LUTs.