-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Remove accidental host synchronization in autograd cpu offload #159698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159698
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Unrelated FailuresAs of commit 67f5adf with merge base e4b123b ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
) | ||
packed.copy_(tensor) | ||
packed.copy_(tensor, non_blocking=actually_pin_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD, @soulitzer what stream would this copy run on? Is it guaranteed that offloading and onloading will run on the same stream? Otherwise we do need a sync between offloading and onloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can empirically check with CUPTI.
I ran the test case I made with cudaProfilerStart() and cudaProfilerStop() added in just before and after the CudasyncGuard context manager to remove noise.
If you go to "stats system view", and then "CUDA GPU Trace", and then sort by the "Strm" column, you will see that all GPU operations run on stream 7.
However, I did realize due to your comment that this CPU memory offloading hook is honestly pretty poor. It uses the same stream being used by the rest of the computations. An ideal implementation would use a different stream for these copies, so that you don't pay a time cost in order to use less memory.
If I were to add this capability to use a separate stream to this existing API, though, it could break users who rely on device memory being immediately recycled (if anyone is even using this at all).
Both Megatron-LM and I think torchao have their own memory offloading context handlers that handle this. I'm happy to give up on this PR, or just merge this without using a separate stream for copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I suspected that this hook is not doing anything smart (and that's why torchao and Megatron need something better), but for workloads that are using different streams for different autograd nodes it might be running on the stream that the node has run on, or some other stream, and I didn't dig too deep in the implementation to know what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this hook is more of a simple showcase than anything extra fancy.
Is it guaranteed that offloading and onloading will run on the same stream?
Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for the confirmation @albanD . I will get this PR mergeable when I have some down time in the next few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you plan on doing in the next iteration?
In particular, we would tradeoff the realibility to enable performance (as I mentioned below) ?
@@ -0,0 +1,42 @@ | |||
# Owner(s): ["module: cuda"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these checks to test_autograd no need for a special file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I believe there is a need. CudaSyncGuard sets a global variable, not a thread-local variable. If another thread in test_autograd.py happens to be running a test concurrently where host synchronization is okay, it will error out if the CudaSyncGuard context manager is active. Therefore, I need to run this test in its own proces (check our run_test.py to see how that is enabled). Does that make sense?
I suppose my concern is not really a concern if every test runs serially. Can I assume that this is the case? If so, I am okay making your proposed change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move that to test/autograd/* folder then and make sure to set the appropriate flag in run_test.py to avoid parallelist.
@ngimel I guess it is expected for the CudaSyncGuard to be global and not thread local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe @serialTest() would be a way to force the single test to run serially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, CudaSyncGuard is global
) | ||
packed.copy_(tensor) | ||
packed.copy_(tensor, non_blocking=actually_pin_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this hook is more of a simple showcase than anything extra fancy.
Is it guaranteed that offloading and onloading will run on the same stream?
Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.
) | ||
packed.copy_(tensor) | ||
packed.copy_(tensor, non_blocking=actually_pin_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the reason we did this one was to ensure that the CPU side object is always correct, whichever way the user pokes at it. Since it is a showcase, we prioritized reliability over performance.
A coworker noticed while working with my draft PR #146924 that
torch.autograd.graph.save_on_cpu
will currently do a device-to-host synchronization after the memcpy after every time you save a tensor to CPU memory in the forward pass. This is obviously a mistake.Unfortunately, testing is a bit difficult. We can't test for a host synchronization directly via doing cuda graph capture, since cuda stream capture currently does not support pin_memory() calls (my coworker @tingyangk hit this issue becuase he is working on #146924 ), so instead I use CudaSyncGuard. This itself is not perfect because it has a global effect. In order to prevent concurrent test cases that synchronize from crashing, I run this test in its own process.
@Varal7 originally added this in 9beb279, but that was several years ago, and I am not sure that he is active any more and able to review this. I don't think this is very actively used code anyway, but I figured it was worthwhile to provide a fix.
cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @xmfan @ptrblck @msaroufim @eqy @jerryzh168