-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Vectorize FP16 umath functions using AVX512 #21955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Might be useful to add a new CI test to run this new content on Intel SDE. |
I have one question before I go further in reviewing this pr. Does the SVML implementation use any of |
Reworked the patch to work on AVX512. Perf numbers for FP16 functions look great with a 33x - 65x speed up (on SkylakeX) depending on the function:
|
PR #21954 adds comprehensive test coverage for these math functions. I will rebase this PR once that is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just requires moving the new intrinsics to the source loops_umath_fp.dispatch.c.src
instead
ping .. |
@r-devulap, Would you please respond to #21955 (comment)? if you disagree then there're a few changes that will need to be done. |
const npy_intp ssrc = steps[0] / lsize; | ||
const npy_intp sdst = steps[1] / lsize; | ||
const npy_intp len = dimensions[0]; | ||
if ((ssrc == 1) && (sdst == 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking memory overlap is still required even with contiguous strides, also are there any reasons for not supporting non-contiguous memory access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added it. AFAIK there is no gather/scatter instruction for 16-bit dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86 gather/scatter supports 16-bit offset. theoretically, it can be emulated via two gather/scatter calls for each full memory access.
67212ee
to
7dfcd39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you
Maybe this should get a release note? Or should we try to summarize all the SIMD changes in one note for the release? |
Thanks @r-devulap |
I just note these as "continuing SIMD improvements" :) A release note wouldn't hurt as FP16 improvements are new. |
This patch leverages the
vcvtps2ph
,vcvtpd2ps
instructions and float32 SVML functions to accelerate float16 umath functions. Max ULP error < 1 for all the math functions.