Skip to content

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Oct 29, 2024

Stack from ghstack (oldest at bottom):

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: D65170967

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139220

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 81ba770 with merge base 419a7e1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

swolchok added a commit that referenced this pull request Oct 29, 2024
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

ghstack-source-id: 250775158
Pull Request resolved: #139220
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

@swolchok swolchok requested a review from malfet October 31, 2024 17:50
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2024
Pull Request resolved: #139220

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .
ghstack-source-id: 251589338

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

@exported-using-ghexport

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65170967

swolchok added a commit that referenced this pull request Nov 4, 2024
Pull Request resolved: #139220

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .
ghstack-source-id: 251716245

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

@exported-using-ghexport

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 8, 2024
@malfet
Copy link
Contributor

malfet commented Nov 8, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request Nov 14, 2024
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Dec 2, 2024
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
@github-actions github-actions bot deleted the gh/swolchok/684/head branch December 9, 2024 02:14
fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request Dec 11, 2024
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: performance topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants