Skip to content

Remove accidental host synchronization in autograd cpu offload #159698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

galv
Copy link
Collaborator

@galv galv commented Aug 2, 2025

A coworker noticed while working with my draft PR #146924 that torch.autograd.graph.save_on_cpu will currently do a device-to-host synchronization after the memcpy after every time you save a tensor to CPU memory in the forward pass. This is obviously a mistake.

Unfortunately, testing is a bit difficult. We can't test for a host synchronization directly via doing cuda graph capture, since cuda stream capture currently does not support pin_memory() calls (my coworker @tingyangk hit this issue becuase he is working on #146924 ), so instead I use CudaSyncGuard. This itself is not perfect because it has a global effect. In order to prevent concurrent test cases that synchronize from crashing, I run this test in its own process.

@Varal7 originally added this in 9beb279, but that was several years ago, and I am not sure that he is active any more and able to review this. I don't think this is very actively used code anyway, but I figured it was worthwhile to provide a fix.

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @xmfan @ptrblck @msaroufim @eqy @jerryzh168

@galv galv requested review from a team, albanD and soulitzer as code owners August 2, 2025 00:40
Copy link

pytorch-bot bot commented Aug 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159698

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 67f5adf with merge base e4b123b (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@galv galv requested a review from ngimel August 2, 2025 00:40
Copy link
Contributor

github-actions bot commented Aug 2, 2025

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

)
packed.copy_(tensor)
packed.copy_(tensor, non_blocking=actually_pin_memory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD, @soulitzer what stream would this copy run on? Is it guaranteed that offloading and onloading will run on the same stream? Otherwise we do need a sync between offloading and onloading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can empirically check with CUPTI.

I ran the test case I made with cudaProfilerStart() and cudaProfilerStop() added in just before and after the CudasyncGuard context manager to remove noise.

report18.nsys-rep.zip

If you go to "stats system view", and then "CUDA GPU Trace", and then sort by the "Strm" column, you will see that all GPU operations run on stream 7.

However, I did realize due to your comment that this CPU memory offloading hook is honestly pretty poor. It uses the same stream being used by the rest of the computations. An ideal implementation would use a different stream for these copies, so that you don't pay a time cost in order to use less memory.

If I were to add this capability to use a separate stream to this existing API, though, it could break users who rely on device memory being immediately recycled (if anyone is even using this at all).

Both Megatron-LM and I think torchao have their own memory offloading context handlers that handle this. I'm happy to give up on this PR, or just merge this without using a separate stream for copying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I suspected that this hook is not doing anything smart (and that's why torchao and Megatron need something better), but for workloads that are using different streams for different autograd nodes it might be running on the stream that the node has run on, or some other stream, and I didn't dig too deep in the implementation to know what happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this hook is more of a simple showcase than anything extra fancy.

Is it guaranteed that offloading and onloading will run on the same stream?

Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks for the confirmation @albanD . I will get this PR mergeable when I have some down time in the next few days.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you plan on doing in the next iteration?
In particular, we would tradeoff the realibility to enable performance (as I mentioned below) ?

@galv galv added module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general labels Aug 3, 2025
@@ -0,0 +1,42 @@
# Owner(s): ["module: cuda"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these checks to test_autograd no need for a special file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I believe there is a need. CudaSyncGuard sets a global variable, not a thread-local variable. If another thread in test_autograd.py happens to be running a test concurrently where host synchronization is okay, it will error out if the CudaSyncGuard context manager is active. Therefore, I need to run this test in its own proces (check our run_test.py to see how that is enabled). Does that make sense?

I suppose my concern is not really a concern if every test runs serially. Can I assume that this is the case? If so, I am okay making your proposed change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move that to test/autograd/* folder then and make sure to set the appropriate flag in run_test.py to avoid parallelist.

@ngimel I guess it is expected for the CudaSyncGuard to be global and not thread local?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe @serialTest() would be a way to force the single test to run serially

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CudaSyncGuard is global

)
packed.copy_(tensor)
packed.copy_(tensor, non_blocking=actually_pin_memory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this hook is more of a simple showcase than anything extra fancy.

Is it guaranteed that offloading and onloading will run on the same stream?

Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.

)
packed.copy_(tensor)
packed.copy_(tensor, non_blocking=actually_pin_memory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the reason we did this one was to ensure that the CPU side object is always correct, whichever way the user pokes at it. Since it is a showcase, we prioritized reliability over performance.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants