Skip to content

[inductor] initial triton static config lookup table #157699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: gh/coconutruben/20/base
Choose a base branch
from

Conversation

coconutruben
Copy link
Contributor

@coconutruben coconutruben commented Jul 7, 2025

Stack from ghstack (oldest at bottom):

Summary:

Why

  • enable initial feature set to see wider internal benchmarking and adoption
  • introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

What

  • First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
  • supports triton, tma, decompose_k, and bias_addmm in configs
  • configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:

buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e

Rollback Plan:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Differential Revision: D77895791

Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157699

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0895c76 with merge base ecea811 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@coconutruben
Copy link
Contributor Author

apologies @jansel @masnesral @PaulZhang12 I broke this out from #156785 to be github first and speed up linting and development etc. This version addresses some of the feedback from last week, namely having a way of passing through EVEN_K and ALLOW_TF32

please lmk what else we should address here

Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2025
Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

assert len(input_nodes) == 3 and input_nodes[0] == mat
size = V.graph.sizevars.size_hints(
size,
fallback=torch._inductor.config.unbacked_symint_fallback,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we hitting unbacked symints in practice? I worry the tuned configs in these causes could be false positives since we don't have realistic shapes. Maybe we should not use the lookup table in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen it in practice yet, but I also haven't looked too thoroughly. If there is an unbacked symint in practice, what happens in the current autotuning logic i.e. how does the sample input get generated for benchmarking it? we can skip the table, but we can also just match whatever that does, as we won't be worse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, let's double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pytorch/pytorch/blob/main/torch/_inductor/autotune_process.py#L341
TensorMeta uses the fallback as well to generate the sizes and strides, and we use those to generate tensors for benchmarking - so it seems it's the same as we're doing here

@etaf etaf added the ciflow/xpu Run XPU CI tasks label Jul 9, 2025
@etaf
Copy link
Collaborator

etaf commented Jul 9, 2025

Added ciflow/xpu to check for potential XPU breakages. Apologies for any inconvenience caused.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
coconutruben added a commit that referenced this pull request Jul 9, 2025
Summary:

\# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

\# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

ghstack-source-id: cbaa42c
Pull Request resolved: #157699
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

ALLOW_TF32 should be part of the key since it changes the semantics of the program.

What difference do you see between having ALLOW_TF32 as part of the key and having it as part of the retrieved value, but we filter out all the values that don't match?
The reason it's in the value is because we need it for template instantiation. If we add it to the key, are we fine keeping it inside the key and the value?

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@coconutruben
Copy link
Contributor Author

ALLOW_TF32 should be part of the key since it changes the semantics of the program.

asked differently, @jansel do you just want to have the value of torch.backends.cuda.matmul.allow_tf32 inside the key?

@jansel
Copy link
Contributor

jansel commented Jul 23, 2025

asked differently, @jansel do you just want to have the value of torch.backends.cuda.matmul.allow_tf32 inside the key?

Yes exactly.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@coconutruben
Copy link
Contributor Author

coconutruben commented Jul 23, 2025

torch.backends.cuda.matmul.allow_tf32

it's now part of the key, and we have this general notion of "input_key_suffix" where we can stuff inductor wide things into that we want to hold

this is a new style input key

"NVIDIA H100+mm+((torch.bfloat16, [1024, 1024], [1024, 1]), (torch.bfloat16, [1024, 1024], [1024, 1]))+tf32=False"

@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@coconutruben coconutruben requested a review from eellison July 30, 2025 04:14
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@BoyuanFeng BoyuanFeng self-requested a review August 9, 2025 03:36
Summary:

# Why

- enable initial feature set to see wider internal benchmarking and adoption
- introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on

# What

- First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm
- supports triton, tma, decompose_k, and bias_addmm in configs
- configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton

Test Plan:
```
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu
buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e
```

Rollback Plan:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791)

[ghstack-poisoned]
@coconutruben
Copy link
Contributor Author

@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -319,13 +320,13 @@

{%- if TMA_EXPERIMENTAL_API %}
a = tl._experimental_descriptor_load(
a_desc,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaulZhang12 can you double check if this is right and makes sense? after rebasing now I kept failing on tma with TMA_EXPERIMENTAL_API, and it seems to me the code here is just wrong (a_desc is not defined when running without TMA_EXPERIMENTAL_API)

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of special casing the config lookup table, can we make the lookup table part of a generic api to override the config selection ? '

We should be able to use one code path for both the config lookup table, and the prediction model.

@coconutruben
Copy link
Contributor Author

Instead of special casing the config lookup table, can we make the lookup table part of a generic api to override the config selection ? '

We should be able to use one code path for both the config lookup table, and the prediction model.

@eellison I can make the lookup table a more generic "config overrider" interface/class, but I'm wondering, what do you and @jansel / @exclamaforte think about treating the performance model as a (generic) config filter in this case? i.e. the interplay of standard config generation, lookup table (LUT) and performance model being

  1. LUT or config heuristic generate the configs
  2. performance models filters them down to topk

I think this would be better, because the performance model is at its most useful when it just rates (runs a prediction) on each config, and then picks the topk from that, rather than the contract being "generate me the best configs"

I plan to evolve the get_mm_configs in choices to a more generic get_template_configs, and take in the full list of templates (in use e.g. triton mm, triton tma, cutlass mm) for one op, iterate over those, and generate all the configs/choices. This then allows us to do one op wide override from the lookup table (just get all the template/config pairs for that op/input from the table) and have the performance model run eval on all the configs (or the ones it can). This also allows op-wide decisions in a single place (e.g. are there 0 matches in the table, do we fallback to ATEN), etc.

what do you guys think of that approach, of having a config generator interface (config heuristics, lookup table) and a config filter interface (performance model)?

@jansel
Copy link
Contributor

jansel commented Aug 12, 2025

I want both of these to go throw a common extension point in choices.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks module: inductor topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants