Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 13 commits into gh/swolchok/680/base from gh/swolchok/680/head

Contributor

swolchok commented Oct 28, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

This is in prepraration for adding NEON Vectorized, which will be simplified by sharing this stuff.

Differential Revision: D64997744

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Extract value_type-generic NEON Vectorized<Half> functions …

f5e6a0f

…to CRTP base class

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139084

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ecf3cbb with merge base 419a7e1 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

This was referenced Oct 28, 2024

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64997744

swolchok mentioned this pull request

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

facebook-github-bot added the fb-exported label

This was referenced Oct 28, 2024

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed


          Update on "[PyTorch] Extract value_type-generic NEON Vectorized<Half>…

40cb2bb

… functions to CRTP base class"

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64997744

swolchok mentioned this pull request

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed


          Update on "[PyTorch] Extract value_type-generic NEON Vectorized<Half>…

623a383

… functions to CRTP base class"

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64997744


          Update on "[PyTorch] Extract value_type-generic NEON Vectorized<Half>…

… functions to CRTP base class"

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64997744

swolchok mentioned this pull request

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed


          Update on "[PyTorch] Extract value_type-generic NEON Vectorized<Half>…

b0b18c3

… functions to CRTP base class"

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64997744

atalman pushed a commit to atalman/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

1d036bb

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Extract value_type-generic NEON Vectorized<Half> functions to CRTP ba…

2ab8d0c

…se class (pytorch#139084)

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

Pull Request resolved: pytorch#139084
Approved by: https://github.com/malfet

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Add Vectorized<c10::BFloat16> specialization for ARM (pytorch#139090)

d03dbdd

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there.

Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?

Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/)

Pull Request resolved: pytorch#139090
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: pytorch#139084

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

534a211

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (pyto…

ea3e177

…rch#139081)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: pytorch#139081
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

15e6384

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

e96ca0f

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Extract value_type-generic NEON Vectorized<Half> functions to CRTP ba…

39ed79a

…se class (pytorch#139084)

This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

Pull Request resolved: pytorch#139084
Approved by: https://github.com/malfet

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Add Vectorized<c10::BFloat16> specialization for ARM (pytorch#139090)

4c1d452

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there.

Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?

Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/)

Pull Request resolved: pytorch#139090
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: pytorch#139084

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (pyto…

c300e15

…rch#139081)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: pytorch#139081
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

1d946ad

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

1a8f885

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

github-actions bot deleted the gh/swolchok/680/head branch

December 9, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu topic: not user facing