Skip to content

[ZENDNN] Integrate ZenDNN library, implement Linear op, add unit-tests #156599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

naveenthangudu
Copy link

@naveenthangudu naveenthangudu commented Jun 23, 2025

🚀 [ZENDNN] Integrate and Optimize zendnn_linear with Fusion and Prepack Support
📌 Summary
This PR introduces the zendnn_linear operator into PyTorch with full support for:

  • Unary and binary post-op fusions (e.g., ReLU, SiLU, GELU, Tanh, Sigmoid)
  • Weight prepacking for performance optimization
  • Extensive unit testing
  • ZenDNN integration with freezing path of torch compile(Inductor) and torch export(AOT-Inductor)

🔧 Key Features

  1. ZenDNN Linear Operator Integration
    Implemented zendnn_linear op
    Registered in PyTorch with meta and shim support
    Integrated into native_functions.yaml and AOT-Inductor backend
  2. Fusion Support
    Enabled unary and binary post-op fusions
    Implemented linear_unary_binary fusion op
    Added meta and shim functions for fusion support
  3. Weight Prepacking
    Introduced zendnn_weight_prepack_for_linear op
    Added graph pass to insert prepack op into AOT-Inductor graph
    Enabled via zendnn.optimize() API when weight_prepack=True
    Validated via unit tests and graph inspection
  4. Testing
    Added unit tests for zendnn_linear and its fusions
    Model-level tests for export and accuracy validation
    Cleaned up redundant comments in test files
  5. Infrastructure Enhancements
    Added ZenDNN as an optional third-party library
    Introduced USE_ZENDNN CMake flag
    Added Python API torch._C.has_zendnn() for runtime checks
    Zen4 CPU detection and compatibility validation

👥 Authors and Contributors
Naveen Kumar T — NAVEEN.THANGUDU@amd.com
Ankit Jaiswal — ankit.jaiswal@amd.com
Mrigank Srivastava — Mrigank.Srivastava@amd.com
Priyansh Jain — priyansh.jain2@amd.com
Dinesh Mareedu — dinesh.mareedu@amd.com
Harshal Adhav — harshal.adhav@amd.com
Charan Ponnada — charan.ponnada@amd.com
Chinmay Kulkarni — Chinmay.Kulkarni@amd.com
Aakar Dwivedi — aakar.dwivedi@amd.com

RFC

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

Copy link

pytorch-bot bot commented Jun 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156599

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7de90a2 with merge base 9708fcf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 23, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this needs to be in code? torch.aten.ops.zendnn_linear can very easily be added as an extension

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 26, 2025
#include <ATen/native/zendnn/Linear_utils.hpp>
#if !AT_ZENDNN_ENABLED()
namespace at::native {
at::Tensor zendnn_linear(const at::Tensor &input, const at::Tensor &weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please point me where weight is prepacked? if done in zendnn, i would suggest we expose the api to zentorch as well, and then have an api to unpack.

Once we have pack/unpack apis for the weight, please add the unit test in the PR as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please point me where weight is prepacked? if done in zendnn, i would suggest we expose the api to zentorch as well, and then have an api to unpack.

Once we have pack/unpack apis for the weight, please add the unit test in the PR as well

We are going to add linear weight prepack API in the revised PR. will add unit tests for the same. will get back to you on unpack op.

We already support weight prepacking for quantized weights in zentorch. Please refer to the links provided for more information.

https://github.com/amd/ZenDNN-pytorch-plugin/blob/2446996647fa950ccf56e53d827a1cb31f6e4109/src/cpu/cpp/WeightReorder.cpp#L23

https://github.com/amd/ZenDNN-pytorch-plugin/blob/main/src/cpu/python/zentorch/_StaticQuantizedLinear.py#L81

cc: @amukho

@naveenthangudu naveenthangudu force-pushed the gh/naveenthangudu/zendnn-ph1-pr1 branch from b7eed5e to d84394b Compare July 21, 2025 19:17
@naveenthangudu naveenthangudu requested review from a team and albanD as code owners July 21, 2025 19:17
@albanD
Copy link
Collaborator

albanD commented Jul 21, 2025

Just to make sure you're not spending too much time on a PR that is not going to be merged.
We do not accept large code-drop PRs in the repo. Every PR needs to be a self contained piece that can be easily reviewed and merged.
Please also make sure to address all the different concerns on the issue and get the green light there to ensure your plan is aligned with maintainers.

@naveenthangudu
Copy link
Author

Just to make sure you're not spending too much time on a PR that is not going to be merged. We do not accept large code-drop PRs in the repo. Every PR needs to be a self contained piece that can be easily reviewed and merged. Please also make sure to address all the different concerns on the issue and get the green light there to ensure your plan is aligned with maintainers.

I've created this PR to demonstrate potential perf improvements with ZenDNN linear ops on EPYC servers. As suggested, I’ll split it into smaller, self-contained PRs. Note that the performance improvements are dependent on fusions, hence gains will become evident only after the first few PRs are merged.

@naveenthangudu naveenthangudu force-pushed the gh/naveenthangudu/zendnn-ph1-pr1 branch from d84394b to 779b34b Compare August 5, 2025 13:35
@naveenthangudu
Copy link
Author

naveenthangudu commented Aug 6, 2025

with this POC PR

impact on binary size as below

Torch Binary Size Without ZenDNN Size with ZenDNN Size Increase Percentage Increase
Torch wheel file 192 MB 197 MB 5 MB 2.6%
Torch CPU library (libtorch_cpu.so) 432 MB 446 MB 14 MB 3.24%

fp32 perf data on genoa machines as below

Suite Pass Rate ZenDNN vs Inductor (geomean)
TorchBench 76/80 1.04
HuggingFace 42/45 1.43
TIMM 57/61 1

will generate and add bf16 data too

will also work to reproduce the numbers on inductor perf board

@jithunnair-amd jithunnair-amd added the ciflow/inductor-perf-test-nightly-x86-zen Trigger inductor perf tests on Zen x86 CPUs label Aug 6, 2025
@naveenthangudu naveenthangudu force-pushed the gh/naveenthangudu/zendnn-ph1-pr1 branch from 779b34b to 5c1f88b Compare August 7, 2025 16:55
@pytorch-bot pytorch-bot bot removed the ciflow/inductor-perf-test-nightly-x86-zen Trigger inductor perf tests on Zen x86 CPUs label Aug 7, 2025
@amukho
Copy link
Contributor

amukho commented Aug 8, 2025

hi @malfet can you please add the ciflow/inductor-perf-test-nightly-x86-zen label to this PR to enable triggering the inductor perf dashboard upload with the changes in this PR?

naveenthangudu and others added 6 commits August 12, 2025 10:39
- Add ZenDNN as a third-party library and link it to ATen
- Add ZenDNN(caffe2::zendnn) as a caffe2 dependency
- Introduce USE_ZENDNN CMake ENV option
    - default OFF
    - user-enableable, valid only on x86_64 hosts
- Provide torch._C.has_zendnn() to query availability from Python
- Define AT_ZENDNN_ENABLED macro to gate C/C++ code
- Add ZenDNN as deperdency to ATen library
- Add ZenDNN as a submodule
- Extend the CMake summary to report ZenDNN build status
- Add USE_ZENDNN  setting and its status to build settings string in __confg__.show()
- Add variable substitutions for bazel build.
- Modify zendnn inductor perf builds with "zen" keyword.
- USE_ZENDNN set to 1 for *zen* build enviroments.

Co-authored-by: Dinesh Mareedu <Dinesh.Mareedu@amd.com>
Co-authored-by: Aakar Dwivedi <aakar.dwivedi@amd.com>
Change-Id: I8fdd93e11384d3550557f163faf34a7b8f18a6a9
- Implement the linear op

Co-authored-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
Co-authored-by: Harshal Adhav <harshal.adhav@amd.com>
Change-Id: Ie62ea7c102ec7d48280f8858695a92137ebca04c
Change-Id: Iedd2988244b35d44f596dd427f2a5378154143ee
-- Implement zendnn_weight_prepack_for_linear op to
   prepack weight into zendnn optimized blocked format
-- Update zendnn_linear op to support prepacked weight
-- Add unit tests to validate the weight prepacking
   zendnn_linear

Change-Id: Ie1e2a2bb3561eb5f8f4dc64431e2c7a9ab2d434d
-- zendnn_linear and weigh prepack support is registered
-- Added in native_functions.yaml
-- Added fake tensor support by meta registrations
-- Added shim file for AOT-Inductor support

Change-Id: I8ecc4685d666dc7ff4364d13ed8dd48d02d98afe
- Add required infra.
- Add optimize in joint_graph_passes.

Co-authored-by: Dinesh Mareedu <Dinesh.Mareedu@amd.com>
Co-authored-by: Charan Ponnada <charan.ponnada@amd.com>
Change-Id: I371d50ca958e75bb048b60ab7521affa39617ae6
charan-ponnada and others added 7 commits August 12, 2025 10:49
-Added amd zen4 detection function
-Added Python binding for is_amd_zen4()
-Check ZenDNN availability (torch._C.has_zendnn), user configuration (USE_ZENDNN env var), and CPU compatibility (torch._C._cpu._is_amd_zen4_or_newer())

Change-Id: Ia57243efdd92a4966b38ae2de715d45e87b9579f
-- Register graph replacement patterns to add
   weight prepack op into aot inductor graph
-- Enable weight prepack optimization through
   zendnn's optimize api when inductor config
   for weight_prepack is True with inductor's
   freezing path
-- Add test to validate accuracy when prepack
   op is inserted, as well to validate weight
   prepack op is inserted into the graph

Change-Id: Ib4722e816cf19d6dadddc0a060067eca698d1063
Change-Id: If9cc90f54cb584430fc1d740594bb7292092fc97
- Enable linear unary and Binary fusions
- Implement linear_unary_binary fusion op

Change-Id: Ifa4c1a4f2711f549962b0392d1edbfbc84e713ca
-- Added shim and meta functions for post-op and unary binary fusions

Change-Id: I517963d8021890be50e47abdd041bd67f6ccd785
- Add relu, silu, gelu, tanh and sigmoid fusions with zendnn_linear.

Co-authored-by: priyansh jain <priyansh.jain2@amd.com>
Change-Id: I49c735b7e848f1c2f04d0dbe85b9a93e29a01c52
    - Added tests for unary and binary fusions with zendnn_linear
    - Added tests for export path
    - Removed unnecessary comments from the test_zendnn_linear.py and
      test_zendnn_linear_fusions.py

Change-Id: Ide4e649a8fc510ea567f07b45823cb21e921fa6d
@naveenthangudu naveenthangudu force-pushed the gh/naveenthangudu/zendnn-ph1-pr1 branch from 0ef40c6 to 7de90a2 Compare August 12, 2025 16:58
@amukho
Copy link
Contributor

amukho commented Aug 12, 2025

hi @malfet @desertfire @jithunnair-amd can you please add the ciflow/inductor-perf-test-nightly-x86-zen label to this PR to enable triggering the inductor perf dashboard upload with the changes in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor open source release notes: inductor (aoti) release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.