[inductor] initial triton static config lookup table #156785

coconutruben · 2025-06-25T02:15:36Z

Summary:

Why

enable initial feature set to see wider internal benchmarking and adoption
introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

What

First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
supports triton, tma, decompose_k, and bias_addmm in configs
configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:

buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e

Rollback Plan:

Differential Revision: D76945358

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-06-25T02:15:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156785

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

❌ 2 New Failures, 2 Unrelated Failures

As of commit 32deb56 with merge base 6215e90 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_inductor/lookup_table.py:
xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 1, 6, linux.idc.xpu) (gh)
'test/inductor/test_lookup_table_e2e.py::test_checking_function'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 3, 6, linux.idc.xpu) (gh) (similar failure)
'Test'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-25T02:15:46Z

This pull request was exported from Phabricator. Differential Revision: D76945358

jansel

Failing tests and lints

torch/_inductor/config.py

jansel · 2025-06-25T04:10:55Z

torch/_inductor/kernel/bmm.py

+    if inductor_config.triton.use_gemm_config_lookup_table:
+        if len(choices) > 1:
+            choices = [choices[-1]]


Why is this needed? Does this mean the lookup table returned multiple values? How does this happen?

This is a way to get "fallback to aten if there is no hit in the table" behavior.
you're right that the lookup table returns at most a single config. We always also add the aten option. So if we're using the lookup table, and there are multiple choices, this means there was a hit, and we skip aten.

Let's refactor this to be more clear.

are you good with the way it's inside lookup_table.py now, abstracted out?

jansel · 2025-06-25T04:12:54Z

torch/_inductor/kernel/mm.py

        for config in mm_configs(
            m,
            n,
            k,
-            **mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize),
+            *mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize),
+            valid=valid_config,


Using config filtering here may be error prone. If we change the config space then you could get zero configs. Can we lookup the correct config from the table?

This is just poorly worded, I can rephrase? The valid_config is used to parse the config out of it (we don't filter for it). If we're using the lookup table, and valid is supplied, there will be either no mm_configs (valid is empty, parsing error, etc) or a single config, the one defined in valid

jansel · 2025-06-25T04:13:25Z

torch/_inductor/kernel/mm.py

@@ -696,14 +699,18 @@ def tuned_mm(mat1, mat2, *, layout=None):
    mm_configs = V.choices.get_base_mm_configs(device_type)
    persistent_mm_configs = V.choices.get_persistent_mm_configs(device_type)
    extra_mm_configs = V.choices.get_extra_mm_configs(device_type)
+    lookup_dict = get_lookup_table([mat1, mat2], name)


Do we need to take into account the epilogue in the lookup table?

IIUC no. This is if you mean the template epilogue (e.g. addmm) because the table is function specific right now, so config e.g. for addmm will be picked specifically for addmm, and is found in the lookup table under the addmm key.

But doesn't the autotuning measure using the epilogue? I recall @eellison added that. This can affect the performance (sometimes a lot of the epilogue causes register spills).

IIUC what @eellison explained to me this should be safe because it happens in two stages. We first find the best config (here) and then we benchmark using that best config + epilogue vs not fusing the epilogue. So in this stage we're not interfering with that decision. We don't do a NxN comparison of configs + epilogue vs configs

@jansel It depends on how we construct the lookup table. If we take the autotune logs for example, then it is not fusion aware. However, if we look at whether the scheduler does a fusion with a certain config in a previous max-autotune run, we can use that config in the lookup table.

The lookup table just uses whichever config(s) we give it. There is no benchmarking of epilogues by default as that can lead to significant compile time overhead.

torch/_inductor/kernel/mm.py

torch/_inductor/kernel/mm_plus_mm.py

torch/_inductor/lookup_table.py

jansel · 2025-06-25T04:17:39Z

torch/_inductor/lookup_table.py

+    """
+    Load and return the gemm config lookup table from file if configured.
+    """
+    global gemm_config_lookup_table


functools.cache()?

we want to support both adding a path but also just defining the table raw lookup_table.py and writing a dict to it. with @cache, I would lose that ability I think?

I think there is a .cache_clear()

implemented an lru_cache now for the file reading itself (path dependent) and the gemm_config_lookup_table can still be set to override the file reading itself. covers the current use cases, prints a warning when both are set, and should work for now, but lmk if you see an issue there still

torch/_inductor/template_heuristics.py

facebook-github-bot · 2025-06-25T06:04:58Z

This pull request was exported from Phabricator. Differential Revision: D76945358

facebook-github-bot · 2025-06-25T06:59:42Z

This pull request was exported from Phabricator. Differential Revision: D76945358

Summary: Pull Request resolved: pytorch#156785 # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358

Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358

facebook-github-bot · 2025-06-25T07:45:01Z

This pull request was exported from Phabricator. Differential Revision: D76945358

Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358

facebook-github-bot · 2025-06-25T20:09:54Z

This pull request was exported from Phabricator. Differential Revision: D76945358

facebook-github-bot · 2025-06-25T20:54:42Z

This pull request was exported from Phabricator. Differential Revision: D76945358

Summary: Pull Request resolved: pytorch#156785 # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358

etaf · 2025-06-26T05:38:34Z

I added ciflow/xpu to test whether this feature would break XPU. I apologize for any inconvenience this may have caused.

Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358

facebook-github-bot · 2025-06-27T00:41:36Z

This pull request was exported from Phabricator. Differential Revision: D76945358

jansel · 2025-06-27T01:52:51Z

Are we including tf32 enablement and device info in the lookup table key?

coconutruben · 2025-06-27T18:48:21Z

Are we including tf32 enablement and device info in the lookup table key?

We're not, for this version of it. In this version that should work/be gated by keeping the config.py for the runtime and lookup table generation.

Next version, we'll definitely have this

jansel · 2025-06-27T20:47:03Z

Won't this effect the file format? How will we maintain BC with the current format when we make this change?

PaulZhang12 · 2025-06-30T16:37:15Z

@jansel what is the issue with tf32 enablement here?

jansel · 2025-06-30T17:35:52Z

The performance with and without tf32 is totally different (it gets mapped to different harware units). So if you autotune with tf32 enabled (a global flag) that config should not be used when it is disabled.

coconutruben · 2025-06-30T17:57:10Z

Won't this effect the file format? How will we maintain BC with the current format when we make this change?

it will affect the format, as v1 format is a subset of the v3 (full) format, as v1. There are 2 pathways

if v1 ends up being adopted by internal users, we will provide a bridge mechanism and translate the lookup tables over. That should be a one and done implementation to translate the tables.
if v1 ends up just being a proof of concept, we'll just deprecate that lookup table format entirely

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 25, 2025

facebook-github-bot added the fb-exported label Jun 25, 2025

coconutruben requested review from jansel and masnesral June 25, 2025 02:16

coconutruben added the topic: not user facing topic category label Jun 25, 2025

jansel requested changes Jun 25, 2025

View reviewed changes

coconutruben force-pushed the export-D76945358 branch from b6d2a02 to fb03bb5 Compare June 25, 2025 06:04

coconutruben force-pushed the export-D76945358 branch from fb03bb5 to 7ebc99f Compare June 25, 2025 06:55

coconutruben force-pushed the export-D76945358 branch from 7ebc99f to c0c4dc3 Compare June 25, 2025 06:59

coconutruben force-pushed the export-D76945358 branch from c0c4dc3 to e52b2b4 Compare June 25, 2025 07:44

coconutruben force-pushed the export-D76945358 branch from e52b2b4 to 67e60b6 Compare June 25, 2025 20:09

coconutruben force-pushed the export-D76945358 branch from 67e60b6 to fbcc4fb Compare June 25, 2025 20:49

coconutruben force-pushed the export-D76945358 branch from fbcc4fb to b8c488b Compare June 25, 2025 20:54

etaf added the ciflow/xpu Run XPU CI tasks label Jun 26, 2025

coconutruben force-pushed the export-D76945358 branch from b8c488b to 32deb56 Compare June 27, 2025 00:41

coconutruben mentioned this pull request Jul 7, 2025

[inductor] initial triton static config lookup table #157699

Open

[inductor] initial triton static config lookup table #156785

Are you sure you want to change the base?

[inductor] initial triton static config lookup table #156785

Uh oh!

Conversation

coconutruben commented Jun 25, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

pytorch-bot bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156785

❗ 1 Active SEVs

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

facebook-github-bot commented Jun 25, 2025

Uh oh!

etaf commented Jun 26, 2025

Uh oh!

facebook-github-bot commented Jun 27, 2025

Uh oh!

jansel commented Jun 27, 2025

Uh oh!

coconutruben commented Jun 27, 2025

Uh oh!

jansel commented Jun 27, 2025

Uh oh!

PaulZhang12 commented Jun 30, 2025

Uh oh!

jansel commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

coconutruben commented Jun 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 25, 2025 •

edited

Loading

jansel commented Jun 30, 2025 •

edited

Loading