Enable explicitly vectorized `_weight_int8pack_mm` op for FP16 dtype on x86_64 CPU #146777

sanchitintel · 2025-02-09T04:00:57Z

Summary

Currently, _weight_int8pack_mm is only explicitly vectorized for BF16 activations for x86_64 CPU, and has different AVX2 & AVX512 implementations.
This PR unifies its separate AVX512 & AVX2 implementations, and also makes it common for Float/BFloat16/Half activation dtypes, which is feasible since compute & accumulation happen in FP32 even in case of FP16/BF16 activations.

Most of the code added in this PR has been copy-pasted from Inductor-CPP FP32 GEMM micro-kernel template (so, credits to the original authors).

There's no performance regression. The input shapes (M, N, K) benchmarked are:
[1, 4096, 4096], [1, 4096, 11008], [1, 11008, 4096], [4, 4096, 4096], [4, 4096, 11008], [4, 11008, 4096], [1, 4096, 14336], [1, 14336, 4096], [4, 4096, 14336], [4, 14336, 4096]

Intel OpenMP & tcmalloc were preloaded for benchmarking.

Now the non-vectorized (not explicitly vectorized) micro-kernel would only be used when:
1 ATEN_CPU_CAPABILITY is default.
2. x86_64 CPUs MSVC builds.
3. aarch64 builds with C10_MOBILE true? Not sure if such builds exist on PyTorch CI

cc @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10 @jerryzh168

Merge AVX512 & AVX2 implementations of vectorized int8 WoQ GEMM for CPU, and make it common for Float/BFloat16/Half dtypes. TODO - [ ] Run CI for all devices - [ ] Check for regressions

pytorch-bot · 2025-02-09T04:01:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146777

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[Infra] Jobs got intermittently cancelled/fail midway checkout

✅ No Failures

As of commit 7792c8d with merge base cc444e7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/native/cpu/int8mm_kernel.cpp

mingfeima

LGTM

sanchitintel · 2025-02-27T23:19:04Z

@pytorchbot merge

pytorchmergebot · 2025-02-27T23:20:55Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

sanchitintel · 2025-02-28T00:05:13Z

Hi @malfet, can you please help review & land this PR? Thank you!

sanchitintel · 2025-06-06T09:54:30Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-06-06T09:56:08Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-06-06T09:56:09Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/main pull/146777/head returned non-zero exit code 1

Rebasing (1/5)
Rebasing (2/5)
Auto-merging test/test_linalg.py
CONFLICT (content): Merge conflict in test/test_linalg.py
error: could not apply 9acae478072... Add UT
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 9acae478072... Add UT

Raised by https://github.com/pytorch/pytorch/actions/runs/15487849970

github-actions · 2025-08-05T10:41:40Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Use vectorized _weight_int8pack_mm on CPU for Half dtype as well

440e978

Merge AVX512 & AVX2 implementations of vectorized int8 WoQ GEMM for CPU, and make it common for Float/BFloat16/Half dtypes. TODO - [ ] Run CI for all devices - [ ] Check for regressions

sanchitintel added ciflow/trunk Trigger trunk jobs on your pull request release notes: performance_as_product intel This tag is for PR from Intel labels Feb 9, 2025

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Feb 9, 2025

pytorchbot added the open source label Feb 9, 2025

sanchitintel commented Feb 9, 2025

View reviewed changes

aten/src/ATen/native/cpu/int8mm_kernel.cpp Show resolved Hide resolved

sanchitintel changed the title ~~Use vectorized _weight_int8pack_mm on CPU for Half dtype as well~~ Enable vectorized _weight_int8pack_mm op on CPU for FP16 Feb 9, 2025

sanchitintel added 4 commits February 9, 2025 20:43

Add UT

9acae47

Had modified some other UT than the one I intended

a01bf26

Fix style

af3893b

Modify another UT to support FP16

a288608

sanchitintel marked this pull request as ready for review February 11, 2025 01:36

sanchitintel requested review from lezcano, nikitaved and IvanYashchuk as code owners February 11, 2025 01:36

sanchitintel requested review from mingfeima and jgong5 February 11, 2025 01:36

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 11, 2025

sanchitintel changed the title ~~Enable vectorized _weight_int8pack_mm op on CPU for FP16~~ Enable explicitly vectorized _weight_int8pack_mm op on CPU for FP16 Feb 11, 2025

mingfeima approved these changes Feb 12, 2025

View reviewed changes

sanchitintel requested a review from malfet February 13, 2025 18:36

jgong5 removed their request for review February 14, 2025 00:54

Merge branch 'main' into sanchitintel/refactor_aten_int8_woq_gemm

7792c8d

sanchitintel changed the title ~~Enable explicitly vectorized _weight_int8pack_mm op on CPU for FP16~~ Enable explicitly vectorized _weight_int8pack_mm op for FP16 dtype on x86_64 CPU Feb 26, 2025

pytorchmergebot added the merging label Feb 27, 2025

pytorchmergebot removed the merging label Feb 27, 2025

janeyx99 added release notes: intel release notes category and removed release notes: performance_as_product labels Apr 18, 2025

github-actions bot added the Stale label Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable explicitly vectorized `_weight_int8pack_mm` op for FP16 dtype on x86_64 CPU #146777

Enable explicitly vectorized `_weight_int8pack_mm` op for FP16 dtype on x86_64 CPU #146777

Uh oh!

sanchitintel commented Feb 9, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

mingfeima left a comment

Uh oh!

sanchitintel commented Feb 27, 2025

Uh oh!

pytorchmergebot commented Feb 27, 2025

Uh oh!

sanchitintel commented Feb 28, 2025

Uh oh!

sanchitintel commented Jun 6, 2025

Uh oh!

pytorchmergebot commented Jun 6, 2025

Uh oh!

pytorchmergebot commented Jun 6, 2025

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

Uh oh!

Enable explicitly vectorized _weight_int8pack_mm op for FP16 dtype on x86_64 CPU #146777

Are you sure you want to change the base?

Enable explicitly vectorized _weight_int8pack_mm op for FP16 dtype on x86_64 CPU #146777

Uh oh!

Conversation

sanchitintel commented Feb 9, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146777

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

sanchitintel commented Feb 27, 2025

Uh oh!

pytorchmergebot commented Feb 27, 2025

Merge failed

Uh oh!

sanchitintel commented Feb 28, 2025

Uh oh!

sanchitintel commented Jun 6, 2025

Uh oh!

pytorchmergebot commented Jun 6, 2025

Uh oh!

pytorchmergebot commented Jun 6, 2025

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

Uh oh!

Enable explicitly vectorized `_weight_int8pack_mm` op for FP16 dtype on x86_64 CPU #146777

Enable explicitly vectorized `_weight_int8pack_mm` op for FP16 dtype on x86_64 CPU #146777

sanchitintel commented Feb 9, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 9, 2025 •

edited

Loading