Skip to content

Enable explicitly vectorized _weight_int8pack_mm op for FP16 dtype on x86_64 CPU #146777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sanchitintel
Copy link
Collaborator

@sanchitintel sanchitintel commented Feb 9, 2025

Summary

Currently, _weight_int8pack_mm is only explicitly vectorized for BF16 activations for x86_64 CPU, and has different AVX2 & AVX512 implementations.
This PR unifies its separate AVX512 & AVX2 implementations, and also makes it common for Float/BFloat16/Half activation dtypes, which is feasible since compute & accumulation happen in FP32 even in case of FP16/BF16 activations.

Most of the code added in this PR has been copy-pasted from Inductor-CPP FP32 GEMM micro-kernel template (so, credits to the original authors).

There's no performance regression. The input shapes (M, N, K) benchmarked are:
[1, 4096, 4096], [1, 4096, 11008], [1, 11008, 4096], [4, 4096, 4096], [4, 4096, 11008], [4, 11008, 4096], [1, 4096, 14336], [1, 14336, 4096], [4, 4096, 14336], [4, 14336, 4096]

Intel OpenMP & tcmalloc were preloaded for benchmarking.

Now the non-vectorized (not explicitly vectorized) micro-kernel would only be used when:
1 ATEN_CPU_CAPABILITY is default.
2. x86_64 CPUs MSVC builds.
3. aarch64 builds with C10_MOBILE true? Not sure if such builds exist on PyTorch CI

cc @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10 @jerryzh168

Merge AVX512 & AVX2 implementations of vectorized int8 WoQ GEMM for CPU, and make it common for Float/BFloat16/Half dtypes.


TODO
- [ ] Run CI for all devices
- [ ] Check for regressions
@sanchitintel sanchitintel added ciflow/trunk Trigger trunk jobs on your pull request release notes: performance_as_product intel This tag is for PR from Intel labels Feb 9, 2025
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Feb 9, 2025
Copy link

pytorch-bot bot commented Feb 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146777

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 7792c8d with merge base cc444e7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@sanchitintel sanchitintel changed the title Use vectorized _weight_int8pack_mm on CPU for Half dtype as well Enable vectorized _weight_int8pack_mm op on CPU for FP16 Feb 9, 2025
@sanchitintel sanchitintel marked this pull request as ready for review February 11, 2025 01:36
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 11, 2025
@sanchitintel sanchitintel changed the title Enable vectorized _weight_int8pack_mm op on CPU for FP16 Enable explicitly vectorized _weight_int8pack_mm op on CPU for FP16 Feb 11, 2025
Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sanchitintel sanchitintel requested a review from malfet February 13, 2025 18:36
@jgong5 jgong5 removed their request for review February 14, 2025 00:54
@sanchitintel sanchitintel changed the title Enable explicitly vectorized _weight_int8pack_mm op on CPU for FP16 Enable explicitly vectorized _weight_int8pack_mm op for FP16 dtype on x86_64 CPU Feb 26, 2025
@sanchitintel
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@sanchitintel
Copy link
Collaborator Author

Hi @malfet, can you please help review & land this PR? Thank you!

@sanchitintel
Copy link
Collaborator Author

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/main pull/146777/head returned non-zero exit code 1

Rebasing (1/5)
Rebasing (2/5)
Auto-merging test/test_linalg.py
CONFLICT (content): Merge conflict in test/test_linalg.py
error: could not apply 9acae478072... Add UT
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 9acae478072... Add UT

Raised by https://github.com/pytorch/pytorch/actions/runs/15487849970

Copy link
Contributor

github-actions bot commented Aug 5, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: intel release notes category Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants