Skip to content

[triton_heuristics] Optimize the triton launcher in pt2 #158897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Jul 23, 2025

Summary:
We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
{F1980557238}

After Triton pin-update:
{F1980557240}

The root cause is because #145051 adds _get_args_with_constexprs to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:

$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.863x

After:

buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x

Rollback Plan:

Differential Revision: D78783302

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Copy link

pytorch-bot bot commented Jul 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158897

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Unrelated Failure

As of commit 195e7da with merge base 4c01991 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78783302

Copy link
Contributor

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@xuzhao9 xuzhao9 force-pushed the export-D78783302 branch from b0242f3 to 8dd47f3 Compare July 23, 2025 06:00
xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Jul 23, 2025
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because pytorch#145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

I am pretty sure that this will work for StaticallyLaunchedCudaKernel, but needs more investigation into whether it will work for triton.compile.CompiledKernel

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78783302

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Jul 23, 2025
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because pytorch#145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

I am pretty sure that this will work for StaticallyLaunchedCudaKernel, but needs more investigation into whether it will work for triton.compile.CompiledKernel

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
@xuzhao9 xuzhao9 force-pushed the export-D78783302 branch from 8dd47f3 to 53c2cf0 Compare July 23, 2025 06:01
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78783302

@xuzhao9 xuzhao9 force-pushed the export-D78783302 branch from 53c2cf0 to 7c4442a Compare July 25, 2025 19:34
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78783302

pytorch-bot bot pushed a commit that referenced this pull request Jul 25, 2025
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because #145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- [WIP] AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
@xuzhao9 xuzhao9 force-pushed the export-D78783302 branch from 7c4442a to 456f84d Compare July 25, 2025 22:19
xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Jul 25, 2025
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because pytorch#145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- [WIP] AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78783302

davidberard98 pushed a commit to davidberard98/pytorch that referenced this pull request Aug 6, 2025
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because pytorch#145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- [WIP] AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
xuzhao9 added 2 commits August 6, 2025 15:51
Differential Revision: D78789289
Summary:

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238} 

After Triton pin-update:
 {F1980557240} 


The root cause is because pytorch#145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (https://www.internalfb.com/code/fbsource/[2f2a11827c50e03a21b75ddd34771f34667fba98]/fbcode/caffe2/torch/_inductor/runtime/static_cuda_launcher.py?lines=215), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- [WIP] AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.893x
```


```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```


After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```


```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 

1.985x
```

Rollback Plan:

Differential Revision: D78783302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants