Skip to content

ENH: Vectorize FP16 umath functions using AVX512 #21955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 28, 2022

Conversation

r-devulap
Copy link
Member

@r-devulap r-devulap commented Jul 8, 2022

This patch leverages the vcvtps2ph, vcvtpd2ps instructions and float32 SVML functions to accelerate float16 umath functions. Max ULP error < 1 for all the math functions.

@r-devulap
Copy link
Member Author

Might be useful to add a new CI test to run this new content on Intel SDE.

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022
@seiko2plus
Copy link
Member

I have one question before I go further in reviewing this pr. Does the SVML implementation use any of AVX512FP16 instruction set or it just count on single-precision operations? if no then there's no need for it at least for now, and AVX512f -> _mm512_cvtph_ps/_mm512_cvtps_ph should provide the same performance.

@r-devulap r-devulap changed the title ENH: Vectorize FP16 umath functions using AVX512FP16 ISA ENH: Vectorize FP16 umath functions using AVX512 Jul 27, 2022
@r-devulap
Copy link
Member Author

Reworked the patch to work on AVX512. Perf numbers for FP16 functions look great with a 33x - 65x speed up (on SkylakeX) depending on the function:

       before           after         ratio
     [6e155790]       [901eb7e1]
     <main>           <fp16-umath>
-        1.46±0ms       45.6±0.3μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'e')
-        1.92±0ms      56.1±0.08μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccos'>, 1, 1, 'e')
-        1.77±0ms       51.5±0.7μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 1, 1, 'e')
-        1.49±0ms       42.4±0.1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'e')
-     3.38±0.01ms         96.4±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsinh'>, 1, 1, 'e')
-        2.23±0ms         61.7±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log1p'>, 1, 1, 'e')
-        2.15±0ms      58.2±0.09μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctan'>, 1, 1, 'e')
-     3.19±0.01ms         86.1±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccosh'>, 1, 1, 'e')
-        1.25±0ms       31.2±0.5μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 1, 1, 'e')
-        1.19±0ms       29.6±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'e')
-        1.26±0ms      29.1±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 1, 'e')
-        2.56±0ms       59.3±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tan'>, 1, 1, 'e')
-     3.45±0.01ms      75.3±0.09μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctanh'>, 1, 1, 'e')
-        1.18±0ms      24.2±0.06μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 1, 'e')
-        1.62±0ms      30.7±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 1, 1, 'e')
-     3.13±0.01ms         54.5±3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'e')
-        2.38±0ms       41.0±0.3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cosh'>, 1, 1, 'e')
-     3.09±0.01ms       49.9±0.2μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sinh'>, 1, 1, 'e')
-        2.26±0ms       35.8±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'expm1'>, 1, 1, 'e')
-        3.16±0ms       47.2±0.1μs     0.01  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cbrt'>, 1, 1, 'e')

@r-devulap
Copy link
Member Author

PR #21954 adds comprehensive test coverage for these math functions. I will rebase this PR once that is merged.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just requires moving the new intrinsics to the source loops_umath_fp.dispatch.c.src instead

@r-devulap
Copy link
Member Author

ping ..

@seiko2plus
Copy link
Member

@r-devulap, Would you please respond to #21955 (comment)? if you disagree then there're a few changes that will need to be done.

const npy_intp ssrc = steps[0] / lsize;
const npy_intp sdst = steps[1] / lsize;
const npy_intp len = dimensions[0];
if ((ssrc == 1) && (sdst == 1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking memory overlap is still required even with contiguous strides, also are there any reasons for not supporting non-contiguous memory access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added it. AFAIK there is no gather/scatter instruction for 16-bit dtype.

Copy link
Member

@seiko2plus seiko2plus Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86 gather/scatter supports 16-bit offset. theoretically, it can be emulated via two gather/scatter calls for each full memory access.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you

@mattip mattip merged commit 2dfd21e into numpy:main Sep 28, 2022
@mattip
Copy link
Member

mattip commented Sep 28, 2022

Maybe this should get a release note? Or should we try to summarize all the SIMD changes in one note for the release?

@mattip
Copy link
Member

mattip commented Sep 28, 2022

Thanks @r-devulap

@charris
Copy link
Member

charris commented Sep 28, 2022

Maybe this should get a release note?

I just note these as "continuing SIMD improvements" :) A release note wouldn't hurt as FP16 improvements are new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants