-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[inductor] initial triton static config lookup table #157699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/coconutruben/20/base
Are you sure you want to change the base?
Conversation
Summary: \# Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on \# What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157699
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0895c76 with merge base ecea811 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
apologies @jansel @masnesral @PaulZhang12 I broke this out from #156785 to be github first and speed up linting and development etc. This version addresses some of the feedback from last week, namely having a way of passing through EVEN_K and ALLOW_TF32 please lmk what else we should address here |
Summary: \# Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on \# What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: \# Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on \# What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: \# Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on \# What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: \# Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on \# What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
torch/_inductor/kernel/mm_common.py
Outdated
assert len(input_nodes) == 3 and input_nodes[0] == mat | ||
size = V.graph.sizevars.size_hints( | ||
size, | ||
fallback=torch._inductor.config.unbacked_symint_fallback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we hitting unbacked symints in practice? I worry the tuned configs in these causes could be false positives since we don't have realistic shapes. Maybe we should not use the lookup table in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen it in practice yet, but I also haven't looked too thoroughly. If there is an unbacked symint in practice, what happens in the current autotuning logic i.e. how does the sample input get generated for benchmarking it? we can skip the table, but we can also just match whatever that does, as we won't be worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, let's double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/pytorch/pytorch/blob/main/torch/_inductor/autotune_process.py#L341
TensorMeta uses the fallback as well to generate the sizes and strides, and we use those to generate tensors for benchmarking - so it seems it's the same as we're doing here
Added ciflow/xpu to check for potential XPU breakages. Apologies for any inconvenience caused. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
What difference do you see between having |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
asked differently, @jansel do you just want to have the value of torch.backends.cuda.matmul.allow_tf32 inside the key? |
Yes exactly. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
it's now part of the key, and we have this general notion of "input_key_suffix" where we can stuff inductor wide things into that we want to hold this is a new style input key
|
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D77895791](https://our.internmc.facebook.com/intern/diff/D77895791) [ghstack-poisoned]
@coconutruben has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -319,13 +320,13 @@ | |||
|
|||
{%- if TMA_EXPERIMENTAL_API %} | |||
a = tl._experimental_descriptor_load( | |||
a_desc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PaulZhang12 can you double check if this is right and makes sense? after rebasing now I kept failing on tma with TMA_EXPERIMENTAL_API, and it seems to me the code here is just wrong (a_desc is not defined when running without TMA_EXPERIMENTAL_API
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of special casing the config lookup table, can we make the lookup table part of a generic api to override the config selection ? '
We should be able to use one code path for both the config lookup table, and the prediction model.
@eellison I can make the lookup table a more generic "config overrider" interface/class, but I'm wondering, what do you and @jansel / @exclamaforte think about treating the performance model as a (generic) config filter in this case? i.e. the interplay of standard config generation, lookup table (LUT) and performance model being
I think this would be better, because the performance model is at its most useful when it just rates (runs a prediction) on each config, and then picks the topk from that, rather than the contract being "generate me the best configs" I plan to evolve the what do you guys think of that approach, of having a config generator interface (config heuristics, lookup table) and a config filter interface (performance model)? |
I want both of these to go throw a common extension point in choices.py. |
Stack from ghstack (oldest at bottom):
Summary:
Why
What
Test Plan:
Rollback Plan:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov
Differential Revision: D77895791