Fuse matmul #157743

nullplay · 2025-07-07T23:00:30Z

Implementation of #151705

This PR introduces the initial implementation of native tl.dot support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.

To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:

Basic support (this PR)
Lazy broadcasting for optimal performance (future PR)

Summary of This PR

This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit tl.reshape and tl.trans operations before calling tl.dot, which introduces some overhead.

Notable Changes

Adds a new config flag: config.triton.enable_native_matmul
Introduces a new ops.dot IR node in Inductor and lowers aten.mm and aten.bmm to it when native matmul is enabled
Enforces tililng suitable for matmul when the native matmul flag is enabled
Implements code generation for ops.dot
Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.

@eellison @jansel @PaulZhang12 @shunting314

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

pytorch-bot · 2025-07-07T23:00:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157743

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures

As of commit 68cfd8e with merge base ca7315c ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
jx_nest_base
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
sebotnet33ts_256
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_Reformer
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
sam
inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_max_autotune.py::TestMaxAutotune::test_mutation_rename
inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_tp_compile_comm_reordering
inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_reuse_kernel_dynamic_cuda
inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor.py::GPUTests::test_buffer_use_after_remove_cuda
inductor / unit-test / cuda12.8-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_reuse_kernel_dynamic_cuda
inductor / unit-test / cuda12.8-py3.12-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor.py::GPUTests::test_buffer_use_after_remove_cuda
inductor / unit-test / cuda12.8-py3.13-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_reuse_kernel_dynamic_cuda
inductor / unit-test / cuda12.8-py3.13-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor.py::GPUTests::test_buffer_use_after_remove_cuda
Lint / lintrunner-mypy / linux-job (gh)
>>> Lint for torch/_inductor/runtime/triton_heuristics.py:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_inductor/ir.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-07T23:00:36Z

The committers listed above are authorized under a signed CLA.

✅ login: nullplay / name: Jaeyeon Won (c539c0b, 3b5fa7d, fc40289, 06d6c2b, 8918c7e, 17b811c, d0959d2, 28f2236, 37042b5, 66632e7, 3930dca, dde968f, f4164ca, 53ffb17, 7f19b67, d8991d0, b160fc1, 4af6b6c, ded1dcf, 1c622f1, 7097229, c140ad5, a629e7e, 00f8f01, f59fbd0, 0e8ae8f, 3aceb5f, 92ecb97, d2d9dfe, 2ed0fa2, 68cfd8e)

jansel · 2025-07-08T05:28:12Z

torch/_inductor/config.py

@@ -1197,6 +1197,9 @@ class triton:
    # For best results, this should be used with prefer_nd_tiling.
    tile_reductions: bool = False

+    # Codegen matmul natively with tl.dot without calling template.
+    enable_native_matmul: bool = False


Can you update the PR to turn this on so we can do a full CI run with it enabled to check for bugs?

(After CI is passing we can turn it off again)

I’ve just enabled it, but I’m not fully confident about the performance due to the potential overhead from the reshape and transpose operations. Back in March, Triton compiler didn’t handle these operations efficiently, which resulted in slower performance. To work around this, I had to modify Inductor to emit alternative code—which I had originally planned to include in a follow-up PR.

tmp0 = tl.load(in_ptr0 + (r0_2 + 128 * y0), r0_mask & ymask, eviction_policy='evict_last', other=0.0) tmp1 = tl.load(in_ptr1 + (x1 + 128 * r0_2), r0_mask & xmask, eviction_policy='evict_last', other=0.0) tmp2 = tl.dot(tl.reshape(tmp0, [YBLOCK, R0_BLOCK]), tl.trans(tl.reshape(tmp1, [XBLOCK, R0_BLOCK])), allow_tf32=False)

jansel · 2025-07-08T05:31:07Z

I haven't looked at this super carefully yet, but I kicked off a benchmark run with it enabled here:
https://github.com/pytorch/pytorch/actions/runs/16134785066

It should show up in the dropdown (nullplay_fuse_matmul) here once the jobs finishes:
https://hud.pytorch.org/benchmark/compilers

jansel · 2025-07-08T18:04:38Z

I approved CI. The benchmark run is done, looks like there are a number of models that are failing:

TB Inference: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2001%20Jul%202025%2017:58:59%20GMT&stopTime=Tue,%2008%20Jul%202025%2017:58:59%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=nullplay_fuse_matmul&lCommit=985e1a07e280c19db24a8603aa795c504f0273e7&rBranch=main&rCommit=1586521461c8dc642735466fc143b7d366a858d0

TB Training:
https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2001%20Jul%202025%2017:58:59%20GMT&stopTime=Tue,%2008%20Jul%202025%2017:58:59%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=nullplay_fuse_matmul&lCommit=985e1a07e280c19db24a8603aa795c504f0273e7&rBranch=main&rCommit=1586521461c8dc642735466fc143b7d366a858d0

The other benchmark suites can be viewed by selecting "nullplay_fuse_matmul" in the branch dropdown (+ training/inference).

nullplay · 2025-07-11T17:50:58Z

I noticed that when doing torch.float16 matmuls, it was automatically upcasting to float32. Disabling config.triton.codegen_upcast_to_fp32 made things faster. I'm not sure what effect this might have on other parts of the code, but I’ve set config.triton.codegen_upcast_to_fp32 = False for now.

I fixed a few bugs and pushed the changes again. Could you re-run the CI and performance benchmarks?

Just to confirm—there’s no way for me to trigger the CI myself, right? Or is there a way to run the tests locally on my end?

pytorch-bot · 2025-07-12T01:29:41Z

To add the ciflow label ciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jansel · 2025-07-12T01:54:54Z

I fixed a few bugs and pushed the changes again. Could you re-run the CI and performance benchmarks?

Something is odd with CI (in this PR and a few others). I don't see any jobs to approve.

There is also a merge conflict. Can you rebase? That will hopefully fix the CI issue.

I noticed that when doing torch.float16 matmuls, it was automatically upcasting to float32. Disabling config.triton.codegen_upcast_to_fp32 made things faster. I'm not sure what effect this might have on other parts of the code, but I’ve set config.triton.codegen_upcast_to_fp32 = False for now.

This is to match what eager pytorch does for pointwise ops. Most of those ops are memory bound so the upcast to fp32 doesn't matter for performance. For matmuls that won't work. We should modify the upcast logic to not apply to matmuls.

Just to confirm—there’s no way for me to trigger the CI myself, right? Or is there a way to run the tests locally on my end?

I just asked to add permissions for you to trigger CI yourself.

You should be able to run tests locally. Failing tests should print out the repro command and the benchamrks are all in the pytorch/benchmarks folder.

jansel · 2025-07-12T21:27:01Z

You should have access to start CI now. I kicked off another benchmark run here: https://github.com/pytorch/pytorch/actions/runs/16242184585

pytorch-bot bot added the module: inductor label Jul 7, 2025

pytorchbot added the open source label Jul 7, 2025

jansel added ciflow/inductor release notes: inductor labels Jul 8, 2025

jansel reviewed Jul 8, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/inductor label Jul 8, 2025

jansel added the ciflow/inductor label Jul 8, 2025

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 8, 2025

pytorch-bot bot removed the ciflow/inductor label Jul 11, 2025

jansel added the ciflow/inductor label Jul 12, 2025

pytorch-bot bot removed the ciflow/inductor label Jul 12, 2025

nullplay force-pushed the fuse_matmul branch from 764fea0 to 11c68fd Compare July 12, 2025 22:34

pytorch-bot bot added the ciflow/inductor label Jul 12, 2025

nullplay force-pushed the fuse_matmul branch 2 times, most recently from cfed28d to 1793aec Compare August 1, 2025 00:05

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 7, 2025

nullplay added 6 commits August 11, 2025 13:05

add ops.dot

3b5fa7d

add ops.dot codegen

28f2236

add heuristics

06d6c2b

fix

a629e7e

lint and fix

d8991d0

enable native matmul = True

00f8f01

nullplay added 24 commits August 11, 2025 13:05

fix and disable upcast fp32

d0959d2

fix and reset upcast_to_fp32

7097229

fix

3aceb5f

test accuracy change

66632e7

cleanup pass

fc40289

fix

0e8ae8f

fix

4af6b6c

fix

7f19b67

minor

dde968f

minor

92ecb97

restrict dtype

2ed0fa2

disable when MKN = 1

17b811c

fix reduction where

f4164ca

fix iteration split on three dimension tiling

ded1dcf

fix

d2d9dfe

fix

1c622f1

lint

b160fc1

don't apply reordering when matmul codegen

53ffb17

dtype fix

8918c7e

minor tests

f59fbd0

test fix

37042b5

broadcast fix

c140ad5

comment

3930dca

fix test

c539c0b

nullplay force-pushed the fuse_matmul branch from cd4682e to c539c0b Compare August 11, 2025 17:06

fix test

68cfd8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse matmul #157743

Fuse matmul #157743

nullplay commented Jul 7, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 7, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jul 7, 2025 •

edited

Loading

Uh oh!

jansel Jul 8, 2025

Uh oh!

nullplay Jul 8, 2025

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

nullplay commented Jul 11, 2025

Uh oh!

pytorch-bot bot commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

Uh oh!

Fuse matmul #157743

Are you sure you want to change the base?

Fuse matmul #157743

Conversation

nullplay commented Jul 7, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation of #151705

Summary of This PR

Notable Changes

Uh oh!

pytorch-bot bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157743

❌ 14 New Failures

Uh oh!

linux-foundation-easycla bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

nullplay Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

nullplay commented Jul 11, 2025

Uh oh!

pytorch-bot bot commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

Uh oh!

nullplay commented Jul 7, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 7, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 7, 2025 •

edited

Loading