-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Vectorized Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…Vectorized Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) ghstack-source-id: 250611665 Pull Request resolved: #139159
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D65120325 |
folding this one into #139081 because I am needing increasingly large parts of it there anyway |
Stack from ghstack (oldest at bottom):
Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).
Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (
objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt
from pytorch root directory afterpython setup.py develop
); observed minor instruction scheduling changes but nothing more.Differential Revision: D65120325
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10