Skip to content

[inductor] initial triton static config lookup table #156785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

coconutruben
Copy link
Contributor

@coconutruben coconutruben commented Jun 25, 2025

Summary:

Why

  • enable initial feature set to see wider internal benchmarking and adoption
  • introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

What

  • First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
  • supports triton, tma, decompose_k, and bias_addmm in configs
  • configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:

buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e

Rollback Plan:

Differential Revision: D76945358

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-bot bot commented Jun 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156785

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 2 Unrelated Failures

As of commit 32deb56 with merge base 6215e90 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests and lints

Comment on lines 253 to 255
if inductor_config.triton.use_gemm_config_lookup_table:
if len(choices) > 1:
choices = [choices[-1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Does this mean the lookup table returned multiple values? How does this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way to get "fallback to aten if there is no hit in the table" behavior.
you're right that the lookup table returns at most a single config. We always also add the aten option. So if we're using the lookup table, and there are multiple choices, this means there was a hit, and we skip aten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this to be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you good with the way it's inside lookup_table.py now, abstracted out?

for config in mm_configs(
m,
n,
k,
**mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize),
*mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize),
valid=valid_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using config filtering here may be error prone. If we change the config space then you could get zero configs. Can we lookup the correct config from the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just poorly worded, I can rephrase? The valid_config is used to parse the config out of it (we don't filter for it). If we're using the lookup table, and valid is supplied, there will be either no mm_configs (valid is empty, parsing error, etc) or a single config, the one defined in valid

@@ -696,14 +699,18 @@ def tuned_mm(mat1, mat2, *, layout=None):
mm_configs = V.choices.get_base_mm_configs(device_type)
persistent_mm_configs = V.choices.get_persistent_mm_configs(device_type)
extra_mm_configs = V.choices.get_extra_mm_configs(device_type)
lookup_dict = get_lookup_table([mat1, mat2], name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to take into account the epilogue in the lookup table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC no. This is if you mean the template epilogue (e.g. addmm) because the table is function specific right now, so config e.g. for addmm will be picked specifically for addmm, and is found in the lookup table under the addmm key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't the autotuning measure using the epilogue? I recall @eellison added that. This can affect the performance (sometimes a lot of the epilogue causes register spills).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC what @eellison explained to me this should be safe because it happens in two stages. We first find the best config (here) and then we benchmark using that best config + epilogue vs not fusing the epilogue. So in this stage we're not interfering with that decision. We don't do a NxN comparison of configs + epilogue vs configs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansel It depends on how we construct the lookup table. If we take the autotune logs for example, then it is not fusion aware. However, if we look at whether the scheduler does a fusion with a certain config in a previous max-autotune run, we can use that config in the lookup table.

The lookup table just uses whichever config(s) we give it. There is no benchmarking of epilogues by default as that can lead to significant compile time overhead.

"""
Load and return the gemm config lookup table from file if configured.
"""
global gemm_config_lookup_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functools.cache()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to support both adding a path but also just defining the table raw lookup_table.py and writing a dict to it. with @cache, I would lose that ability I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a .cache_clear()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented an lru_cache now for the file reading itself (path dependent) and the gemm_config_lookup_table can still be set to override the file reading itself. covers the current use cases, prints a warning when both are set, and should work for now, but lmk if you see an issue there still

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

coconutruben added a commit to coconutruben/pytorch that referenced this pull request Jun 25, 2025
Summary:
Pull Request resolved: pytorch#156785

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

Differential Revision: D76945358
coconutruben added a commit to coconutruben/pytorch that referenced this pull request Jun 25, 2025
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

Differential Revision: D76945358
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

coconutruben added a commit to coconutruben/pytorch that referenced this pull request Jun 25, 2025
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

Differential Revision: D76945358
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

coconutruben added a commit to coconutruben/pytorch that referenced this pull request Jun 25, 2025
Summary:
Pull Request resolved: pytorch#156785

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

Differential Revision: D76945358
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Jun 26, 2025
@etaf
Copy link
Collaborator

etaf commented Jun 26, 2025

I added ciflow/xpu to test whether this feature would break XPU. I apologize for any inconvenience this may have caused.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

Differential Revision: D76945358
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76945358

@jansel
Copy link
Contributor

jansel commented Jun 27, 2025

Are we including tf32 enablement and device info in the lookup table key?

@coconutruben
Copy link
Contributor Author

Are we including tf32 enablement and device info in the lookup table key?

We're not, for this version of it. In this version that should work/be gated by keeping the config.py for the runtime and lookup table generation.

Next version, we'll definitely have this

@jansel
Copy link
Contributor

jansel commented Jun 27, 2025

Won't this effect the file format? How will we maintain BC with the current format when we make this change?

@PaulZhang12
Copy link
Contributor

@jansel what is the issue with tf32 enablement here?

@jansel
Copy link
Contributor

jansel commented Jun 30, 2025

The performance with and without tf32 is totally different (it gets mapped to different harware units). So if you autotune with tf32 enabled (a global flag) that config should not be used when it is disabled.

@coconutruben
Copy link
Contributor Author

Won't this effect the file format? How will we maintain BC with the current format when we make this change?

it will affect the format, as v1 format is a subset of the v3 (full) format, as v1. There are 2 pathways

  • if v1 ends up being adopted by internal users, we will provide a bridge mechanism and translate the lookup tables over. That should be a one and done implementation to translate the tables.
  • if v1 ends up just being a proof of concept, we'll just deprecate that lookup table format entirely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants