Skip to content

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Oct 29, 2024

Stack from ghstack (oldest at bottom):

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt from pytorch root directory after python setup.py develop); observed minor instruction scheduling changes but nothing more.

Differential Revision: D65120325

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

…Vectorized

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 29, 2024
Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139159

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 40 Cancelled Jobs

As of commit a78691a with merge base 3e0f4d1 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

swolchok added a commit that referenced this pull request Oct 29, 2024
…Vectorized

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

ghstack-source-id: 250611665
Pull Request resolved: #139159
…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

@swolchok swolchok added ciflow/mps Run MPS tests (subset of trunk) ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 31, 2024
…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

@swolchok
Copy link
Contributor Author

swolchok commented Nov 1, 2024

folding this one into #139081 because I am needing increasingly large parts of it there anyway

@swolchok swolchok closed this Nov 1, 2024
@github-actions github-actions bot deleted the gh/swolchok/682/head branch December 2, 2024 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) fb-exported module: cpu CPU specific problem (e.g., perf, algorithm) topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants