-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[inductor] initial triton static config lookup table #156785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156785
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 32deb56 with merge base 6215e90 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D76945358 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests and lints
torch/_inductor/kernel/bmm.py
Outdated
if inductor_config.triton.use_gemm_config_lookup_table: | ||
if len(choices) > 1: | ||
choices = [choices[-1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Does this mean the lookup table returned multiple values? How does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a way to get "fallback to aten if there is no hit in the table" behavior.
you're right that the lookup table returns at most a single config. We always also add the aten option. So if we're using the lookup table, and there are multiple choices, this means there was a hit, and we skip aten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you good with the way it's inside lookup_table.py now, abstracted out?
for config in mm_configs( | ||
m, | ||
n, | ||
k, | ||
**mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize), | ||
*mm_config_kwargs(device_type, _is_large_block_for_cpu, dtype.itemsize), | ||
valid=valid_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using config filtering here may be error prone. If we change the config space then you could get zero configs. Can we lookup the correct config from the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just poorly worded, I can rephrase? The valid_config is used to parse the config out of it (we don't filter for it). If we're using the lookup table, and valid is supplied, there will be either no mm_configs (valid is empty, parsing error, etc) or a single config, the one defined in valid
@@ -696,14 +699,18 @@ def tuned_mm(mat1, mat2, *, layout=None): | |||
mm_configs = V.choices.get_base_mm_configs(device_type) | |||
persistent_mm_configs = V.choices.get_persistent_mm_configs(device_type) | |||
extra_mm_configs = V.choices.get_extra_mm_configs(device_type) | |||
lookup_dict = get_lookup_table([mat1, mat2], name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to take into account the epilogue in the lookup table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC no. This is if you mean the template epilogue (e.g. addmm) because the table is function specific right now, so config e.g. for addmm will be picked specifically for addmm, and is found in the lookup table under the addmm key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't the autotuning measure using the epilogue? I recall @eellison added that. This can affect the performance (sometimes a lot of the epilogue causes register spills).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC what @eellison explained to me this should be safe because it happens in two stages. We first find the best config (here) and then we benchmark using that best config + epilogue vs not fusing the epilogue. So in this stage we're not interfering with that decision. We don't do a NxN comparison of configs + epilogue vs configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jansel It depends on how we construct the lookup table. If we take the autotune logs for example, then it is not fusion aware. However, if we look at whether the scheduler does a fusion with a certain config in a previous max-autotune run, we can use that config in the lookup table.
The lookup table just uses whichever config(s) we give it. There is no benchmarking of epilogues by default as that can lead to significant compile time overhead.
torch/_inductor/lookup_table.py
Outdated
""" | ||
Load and return the gemm config lookup table from file if configured. | ||
""" | ||
global gemm_config_lookup_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functools.cache()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to support both adding a path but also just defining the table raw lookup_table.py and writing a dict to it. with @cache
, I would lose that ability I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a .cache_clear()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented an lru_cache now for the file reading itself (path dependent) and the gemm_config_lookup_table
can still be set to override the file reading itself. covers the current use cases, prints a warning when both are set, and should work for now, but lmk if you see an issue there still
b6d2a02
to
fb03bb5
Compare
This pull request was exported from Phabricator. Differential Revision: D76945358 |
fb03bb5
to
7ebc99f
Compare
This pull request was exported from Phabricator. Differential Revision: D76945358 |
Summary: Pull Request resolved: pytorch#156785 # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358
7ebc99f
to
c0c4dc3
Compare
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358
c0c4dc3
to
e52b2b4
Compare
This pull request was exported from Phabricator. Differential Revision: D76945358 |
e52b2b4
to
67e60b6
Compare
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358
This pull request was exported from Phabricator. Differential Revision: D76945358 |
67e60b6
to
fbcc4fb
Compare
This pull request was exported from Phabricator. Differential Revision: D76945358 |
Summary: Pull Request resolved: pytorch#156785 # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358
fbcc4fb
to
b8c488b
Compare
I added ciflow/xpu to test whether this feature would break XPU. I apologize for any inconvenience this may have caused. |
Summary: # Why - enable initial feature set to see wider internal benchmarking and adoption - introduce expected behavior testing that subsequent expansions (more backends, functions, etc) can rely on # What - First version of a static lookup table for Triton configs across mm, addmm, bmm, mm_plus_mm - supports triton, tma, decompose_k, and bias_addmm in configs - configuration inside lookup_table.py for now, with knob to turn on/off in inductor_config.triton Test Plan: ``` buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:template_heuristics_cpu buck2 test mode/opt fbcode//caffe2/test/inductor:lookup_table_e2e ``` Rollback Plan: Differential Revision: D76945358
b8c488b
to
32deb56
Compare
This pull request was exported from Phabricator. Differential Revision: D76945358 |
Are we including tf32 enablement and device info in the lookup table key? |
We're not, for this version of it. In this version that should work/be gated by keeping the config.py for the runtime and lookup table generation. Next version, we'll definitely have this |
Won't this effect the file format? How will we maintain BC with the current format when we make this change? |
@jansel what is the issue with tf32 enablement here? |
The performance with and without tf32 is totally different (it gets mapped to different harware units). So if you autotune with tf32 enabled (a global flag) that config should not be used when it is disabled. |
it will affect the format, as v1 format is a subset of the v3 (full) format, as v1. There are 2 pathways
|
Summary:
Why
What
Test Plan:
Rollback Plan:
Differential Revision: D76945358
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov