ENH: Vectorize FP16 umath functions using AVX512 #21955

r-devulap · 2022-07-08T21:51:00Z

This patch leverages the vcvtps2ph, vcvtpd2ps instructions and float32 SVML functions to accelerate float16 umath functions. Max ULP error < 1 for all the math functions.

r-devulap · 2022-07-08T22:13:23Z

Might be useful to add a new CI test to run this new content on Intel SDE.

seiko2plus · 2022-07-14T05:24:53Z

I have one question before I go further in reviewing this pr. Does the SVML implementation use any of AVX512FP16 instruction set or it just count on single-precision operations? if no then there's no need for it at least for now, and AVX512f -> _mm512_cvtph_ps/_mm512_cvtps_ph should provide the same performance.

r-devulap · 2022-07-27T19:24:29Z

Reworked the patch to work on AVX512. Perf numbers for FP16 functions look great with a 33x - 65x speed up (on SkylakeX) depending on the function:

       before           after         ratio
     [6e155790]       [901eb7e1]
     <main>           <fp16-umath>
-        1.46±0ms       45.6±0.3μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'e')
-        1.92±0ms      56.1±0.08μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccos'>, 1, 1, 'e')
-        1.77±0ms       51.5±0.7μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 1, 1, 'e')
-        1.49±0ms       42.4±0.1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'e')
-     3.38±0.01ms         96.4±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsinh'>, 1, 1, 'e')
-        2.23±0ms         61.7±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log1p'>, 1, 1, 'e')
-        2.15±0ms      58.2±0.09μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctan'>, 1, 1, 'e')
-     3.19±0.01ms         86.1±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccosh'>, 1, 1, 'e')
-        1.25±0ms       31.2±0.5μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 1, 1, 'e')
-        1.19±0ms       29.6±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'e')
-        1.26±0ms      29.1±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 1, 'e')
-        2.56±0ms       59.3±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tan'>, 1, 1, 'e')
-     3.45±0.01ms      75.3±0.09μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctanh'>, 1, 1, 'e')
-        1.18±0ms      24.2±0.06μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 1, 'e')
-        1.62±0ms      30.7±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 1, 1, 'e')
-     3.13±0.01ms         54.5±3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'e')
-        2.38±0ms       41.0±0.3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cosh'>, 1, 1, 'e')
-     3.09±0.01ms       49.9±0.2μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sinh'>, 1, 1, 'e')
-        2.26±0ms       35.8±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'expm1'>, 1, 1, 'e')
-        3.16±0ms       47.2±0.1μs     0.01  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cbrt'>, 1, 1, 'e')

r-devulap · 2022-07-27T20:29:38Z

PR #21954 adds comprehensive test coverage for these math functions. I will rebase this PR once that is merged.

seiko2plus

LGTM, just requires moving the new intrinsics to the source loops_umath_fp.dispatch.c.src instead

numpy/core/src/common/simd/avx512/conversion.h

r-devulap · 2022-09-15T21:57:19Z

ping ..

seiko2plus · 2022-09-18T04:04:50Z

@r-devulap, Would you please respond to #21955 (comment)? if you disagree then there're a few changes that will need to be done.

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

seiko2plus · 2022-09-25T16:49:42Z

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

+    const npy_intp ssrc = steps[0] / lsize;
+    const npy_intp sdst = steps[1] / lsize;
+    const npy_intp len = dimensions[0];
+    if ((ssrc == 1) && (sdst == 1)) {


checking memory overlap is still required even with contiguous strides, also are there any reasons for not supporting non-contiguous memory access?

Makes sense. Added it. AFAIK there is no gather/scatter instruction for 16-bit dtype.

x86 gather/scatter supports 16-bit offset. theoretically, it can be emulated via two gather/scatter calls for each full memory access.

seiko2plus

LGTM, Thank you

mattip · 2022-09-28T08:11:49Z

Maybe this should get a release note? Or should we try to summarize all the SIMD changes in one note for the release?

mattip · 2022-09-28T08:12:01Z

Thanks @r-devulap

charris · 2022-09-28T13:02:49Z

Maybe this should get a release note?

I just note these as "continuing SIMD improvements" :) A release note wouldn't hurt as FP16 improvements are new.

github-actions bot added the 01 - Enhancement label Jul 8, 2022

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022

r-devulap force-pushed the fp16-umath branch from fa1bf6d to 6beecb3 Compare July 27, 2022 19:20

r-devulap changed the title ~~ENH: Vectorize FP16 umath functions using AVX512FP16 ISA~~ ENH: Vectorize FP16 umath functions using AVX512 Jul 27, 2022

seiko2plus requested changes Aug 17, 2022

View reviewed changes

numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved

seiko2plus requested changes Sep 25, 2022

View reviewed changes

r-devulap force-pushed the fp16-umath branch from 67212ee to 7dfcd39 Compare September 26, 2022 16:44

Raghuveer Devulapalli added 6 commits September 26, 2022 09:52

BENCH: Add benchmarks for fp16 umath functions

6070491

SIMD: Add universal intrinsic support for SIMD float16 using AVX-512

0e87c99

ENH: Vectorize FP16 math functions on Intel SkylakeX

8dd6761

MAINT: Fix linter error

80f0015

MAINT: Move AVX512 fp16 universal intrinsic to dispatch file

a13006a

BUG: Add memoverlap check

7dfcd39

seiko2plus approved these changes Sep 27, 2022

View reviewed changes

mattip merged commit 2dfd21e into numpy:main Sep 28, 2022

mattip mentioned this pull request Oct 23, 2022

BUG: RuntimeWarning on ARM64 but not x86_64 for cosine of float32 NaNs #22461

Closed

rgommers mentioned this pull request Nov 15, 2022

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

Merged

r-devulap mentioned this pull request Mar 6, 2023

ENH: Use AVX512-FP16 SVML content for float16 umath functions #23351

Merged

r-devulap mentioned this pull request Jul 17, 2023

MAINT: Update meson.build files from main branch #24186

Merged

Rohanjames1997 mentioned this pull request Oct 8, 2024

ENH: Speed up umath functions using NEON/SVE | SIMD #27533

Open

Uh oh!

ENH: Vectorize FP16 umath functions using AVX512 #21955

ENH: Vectorize FP16 umath functions using AVX512 #21955

Uh oh!

Conversation

r-devulap commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Jul 8, 2022

Uh oh!

seiko2plus commented Jul 14, 2022

Uh oh!

r-devulap commented Jul 27, 2022

Uh oh!

r-devulap commented Jul 27, 2022

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

r-devulap commented Sep 15, 2022

Uh oh!

seiko2plus commented Sep 18, 2022

Uh oh!

Uh oh!

Uh oh!

seiko2plus Sep 25, 2022

Choose a reason for hiding this comment

Uh oh!

r-devulap Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Sep 28, 2022

Uh oh!

mattip commented Sep 28, 2022

Uh oh!

charris commented Sep 28, 2022

Uh oh!

Uh oh!

r-devulap commented Jul 8, 2022 •

edited

Loading

seiko2plus Sep 27, 2022 •

edited

Loading