[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 10 commits into gh/swolchok/661/base from gh/swolchok/661/head

Contributor

swolchok commented Oct 14, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

float16_t is ARM-specific. Half is not.

Differential Revision: D64218427

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures

b8dd9d0

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137913

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ca2264e with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

Contributor

facebook-github-bot commented Oct 14, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

facebook-github-bot added the fb-exported label

This was referenced Oct 14, 2024

[PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) #137722

Closed

[PyTorch] Use 128-bit vectors for ARM64 #137426

Closed

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed

Skylion007 approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label


          Update on "[PyTorch] Use Half, not float16_t, in fp16 gemv fast path …

97a3b11

…signatures"

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 15, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

swolchok mentioned this pull request

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed


          Update on "[PyTorch] Use Half, not float16_t, in fp16 gemv fast path …

4db795b

…signatures"

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

swolchok mentioned this pull request

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

swolchok requested a review from malfet

October 17, 2024 22:52

swolchok added the topic: not user facing label

malfet reviewed

View reviewed changes

Contributor

malfet left a comment

Is this to make it accessible on x86? Otherwise float16_t feels like a reasonable type, isn't it?


          Update on "[PyTorch] Use Half, not float16_t, in fp16 gemv fast path …

a8927d3

…signatures"

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

This was referenced Oct 22, 2024

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

[PyTorch] Fix inductor bug with unrolled vectorized prod #138542

Closed

Contributor

facebook-github-bot commented Oct 24, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

This was referenced Oct 24, 2024

[PyTorch] Fix ASAN failures for vec_test_all_types Cast test #138716

Closed

[PyTorch] Fix out-of-bounds array access in atomic_add_vec #138744

Closed

swolchok requested a review from malfet

October 25, 2024 19:45

malfet approved these changes

View reviewed changes


          Update on "[PyTorch] Use Half, not float16_t, in fp16 gemv fast path …

a847d4b

…signatures"

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

This was referenced Oct 28, 2024

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed


          Update on "[PyTorch] Use Half, not float16_t, in fp16 gemv fast path …

ca2264e

…signatures"

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64218427

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

pytorchmergebot added the Merged label

pytorchmergebot closed this in

6502d6c

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

aafbea4

…pu/ (#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: #137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (#137915)

5be1556

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: #137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

fc2d0da

…whole vector register instead of half (#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: #137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

b29c170

…s for non-ARM architectures too (#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: #137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (p…

bfb12c3

…ytorch#137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: pytorch#137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

fd55e52

…pu/ (pytorch#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: pytorch#137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (pytorch#…

e16cfce

…137915)

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: pytorch#137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

…s for non-ARM architectures too (pytorch#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: pytorch#137917
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

github-actions bot deleted the gh/swolchok/661/head branch

November 29, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu topic: not user facing