-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. #157566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157566
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 2b2bb11 with merge base f636736 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
a86c9d4
to
9a9e576
Compare
@pytorchbot label "release notes: dynamo" |
9a9e576
to
23ea862
Compare
@anijain2305 Could you please take a look at my code? I would really appreciate your feedback |
@xmfan Could you please take a look at my code? I would really appreciate your feedback |
torch/_dynamo/variables/functions.py
Outdated
@@ -1387,7 +1387,9 @@ def as_python_constant(self): | |||
|
|||
@classmethod | |||
def create_with_source(cls, value, source): | |||
if not is_wrapper_or_member_descriptor(value): | |||
if inspect.getattr_static(value, "_torchdynamo_disable", False): | |||
install_guard(source.make_guard(GuardBuilder.TYPE_MATCH)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe install the FUNCTION_MATCH on the value._torchdynamo_disable
. Something along the lines of
fn_obj = value._torchdynamo_disable
install_guard(AttrSource(source, "_torchdynamo_disable")).make_guard(GuardBuilder.FUNCTION_MATCH))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to dynamo so forgive silly question, but could we add a new rule to trace_rules.manual_torch_name_rule_map instead?
"torch._dynamo.disable": SkipFunctionVariable,
It wouldn't introduce additional code, but I didn't test it beside a happy path. @anijain2305 @thenumberouscode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe install the FUNCTION_MATCH on the
value._torchdynamo_disable
. Something along the lines offn_obj = value._torchdynamo_disable install_guard(AttrSource(source, "_torchdynamo_disable")).make_guard(GuardBuilder.FUNCTION_MATCH))
Your code indeed offers a more granular solution. I'm just waiting for the results of all the workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manual_torch_name_rule_map
Have you tested it in your local environment? I’m afraid it may not be effective in solving this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just gave it a quick test to see if the bug occurs and it doesn’t, but didn’t test it further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anijain2305 I don't think this is sound - _torchdynamo_disable
is only ever True
right? At the root of the issue is that the ID of a local nested function is changing on each call, I don't think we can guarantee that it's semantically the same function though (which is why we usually check function ID). We could have any number of closures inside the function and pass them as arguments to the resume function based on different conditions and we wouldn't recompile the resume function even though we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlazos I’m a bit confused about something. To me, a function that doesn’t need to be compiled means we don’t have to worry about its implementation, arguments, or any other details; we just pass it to Python’s default interpreter. cc @anijain2305
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlazos I think I understand what you're pointing out. You mean that my bug fix will cause all user functions decorated with _torchdynamo_disable not to be checked by ID whether they are nested functions or not, it is too wide open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah your second comment is similar to what I'm saying. The reason we check functions by ID is that you can't check semantic equivalence of functions. Even with nested functions it still won't work. The main reason is that let's say you have different nested functions and you conditionally pass them to the resume function. It looks like this.
@torch.compile()
def outer(x, cond):
@torch._dynamo.disable()
def fn0(y):
return y + 1
@torch._dynamo.disable()
def fn1(y):
return y + 2
if cond:
f = fn0
else:
f = fn1
torch._dynamo.graph_break()
# there will be a resume function here
return f()
In this case if you flip cond
from True
to False
the outer function will recompile and the resume funciton after the graph break also needs to recompile, because its behavior will change.
In this case you need to guard on the ID of f
even though it will change on every call to outer. If the resume function doesn't properly recompile if f
changes, this code will be incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, your understanding of the guard is very deep—I’ve learned a lot from you. Now, let's fix it according to your suggestions.
@pytorchbot label "ciflow/trunk" |
@anijain2305 Hi, A workflow test has failed (https://github.com/pytorch/pytorch/actions/runs/16063185244/job/45383170740?pr=157566) due to a timeout, as indicated in the log. I suspect this issue is unrelated to my bug fix. Could we please retry the test? Thank you! |
23ea862
to
773e750
Compare
773e750
to
1a14970
Compare
@mlazos @anijain2305 I have added a cache for dynamically created user functions decorated with torch._dynamo.disable, allowing us to use the real user function IDs with the ID_MATCH guard without needing recompilation. Please check out my new code and share your feedback when you get a chance. |
24410c2
to
7f8143d
Compare
Can you explain a little more how it works? looking at the code it isn't clear to me why it works haha |
Sure, the call of resume functions always have the following format: pytorch/torch/_dynamo/eval_frame.py Line 1002 in 6f23f53
|
Ah this makes sense now, I didn't realize how We should also clear it if |
7f8143d
to
a4ed3f5
Compare
@mlazos After conducting a local test, I've discovered that the finalizer in OutputGraph is not functioning in this scenario. Since our cache is shared across different OutputGraph instances, we cannot clear it during the exit process of a single graph. Our cache needs to persist throughout the entire compilation process, similar to how code caches operate. To address this issue, we should implement a CreateNestedFnCache to manage all function caches, which can then be cleared using torch._dynamo.reset(). Please take a look at my new commit to see the implementations, and don't hesitate to share any feedback or suggestions! |
@mlazos The workflow failed again for a unrelated lint problem. Could we just run the workflow again? thanks. |
…o.disable to obtain their actual IDs, thereby avoiding ID_MATCH guard failures.
93eb5d7
to
2b2bb11
Compare
@mlazos @anijain2305 Can we trigger another workflow run? Appreciate it. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "failed an odd internal test, please reach out to metamate to fix it, D79112610" -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
… function to avoid unnecessary recompilation. (#157566)" This reverts commit 8e07c98. Reverted #157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](#157566 (comment)))
@thenumberouscode your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
…n to avoid unnecessary recompilation. (#157566) inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation, it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation. Fixes #157399 @xmfan Pull Request resolved: #157566 Approved by: https://github.com/mlazos, https://github.com/anijain2305
… function to avoid unnecessary recompilation. (#157566)" This reverts commit 8e07c98. Reverted #157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](#157566 (comment)))
@yangw-dev Which unit test failed? I'll fix it. |
Hi @yangw-dev can you point me to the test failure? I only see one build failure on that diff |
@yangw-dev Can you let me know which unit test failed? cc @mlazos |
@mlazos , since we don't have @yangw-dev help here, what should we do next? I've put a lot of time into this PR and really don't want it to go to waste. |
Nw we'll figure this out, I was at an offsite last week so didn't have a chance to follow up. I'll take a look at this this week and try to debug the diff. |
Thanks a lot for your help! |
inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation, it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation.
Fixes #157399
@xmfan
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos