Remove accidental host synchronization in autograd cpu offload #159698

galv · 2025-08-02T00:40:35Z

A coworker noticed while working with my draft PR #146924 that torch.autograd.graph.save_on_cpu will currently do a device-to-host synchronization after the memcpy after every time you save a tensor to CPU memory in the forward pass. This is obviously a mistake.

Unfortunately, testing is a bit difficult. We can't test for a host synchronization directly via doing cuda graph capture, since cuda stream capture currently does not support pin_memory() calls (my coworker @tingyangk hit this issue becuase he is working on #146924 ), so instead I use CudaSyncGuard. This itself is not perfect because it has a global effect. In order to prevent concurrent test cases that synchronize from crashing, I run this test in its own process.

@Varal7 originally added this in 9beb279, but that was several years ago, and I am not sure that he is active any more and able to review this. I don't think this is very actively used code anyway, but I figured it was worthwhile to provide a fix.

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @xmfan @ptrblck @msaroufim @eqy @jerryzh168

pytorch-bot · 2025-08-02T00:40:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159698

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 67f5adf with merge base e4b123b ():

NEW FAILURE - The following job has failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, lf.linux.4xlarge) (gh) (similar failure)
cpp/test_jit 1/1 failed!
pull / linux-jammy-py3.9-clang12 / test (crossref, 2, 2, lf.linux.2xlarge) (gh) (similar failure)
cpp/test_jit 1/1 failed!

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-02T00:41:06Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ngimel · 2025-08-02T01:25:27Z

torch/autograd/graph.py

            )
-            packed.copy_(tensor)
+            packed.copy_(tensor, non_blocking=actually_pin_memory)


@albanD, @soulitzer what stream would this copy run on? Is it guaranteed that offloading and onloading will run on the same stream? Otherwise we do need a sync between offloading and onloading.

We can empirically check with CUPTI.

I ran the test case I made with cudaProfilerStart() and cudaProfilerStop() added in just before and after the CudasyncGuard context manager to remove noise.

report18.nsys-rep.zip

If you go to "stats system view", and then "CUDA GPU Trace", and then sort by the "Strm" column, you will see that all GPU operations run on stream 7.

However, I did realize due to your comment that this CPU memory offloading hook is honestly pretty poor. It uses the same stream being used by the rest of the computations. An ideal implementation would use a different stream for these copies, so that you don't pay a time cost in order to use less memory.

If I were to add this capability to use a separate stream to this existing API, though, it could break users who rely on device memory being immediately recycled (if anyone is even using this at all).

Both Megatron-LM and I think torchao have their own memory offloading context handlers that handle this. I'm happy to give up on this PR, or just merge this without using a separate stream for copying.

Yeah I suspected that this hook is not doing anything smart (and that's why torchao and Megatron need something better), but for workloads that are using different streams for different autograd nodes it might be running on the stream that the node has run on, or some other stream, and I didn't dig too deep in the implementation to know what happens.

Yes this hook is more of a simple showcase than anything extra fancy.

Is it guaranteed that offloading and onloading will run on the same stream?

Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.

Alright, thanks for the confirmation @albanD . I will get this PR mergeable when I have some down time in the next few days.

What exactly do you plan on doing in the next iteration?
In particular, we would tradeoff the realibility to enable performance (as I mentioned below) ?

albanD · 2025-08-04T14:30:40Z

test/test_cuda_sync_guard.py

@@ -0,0 +1,42 @@
+# Owner(s): ["module: cuda"]


Please move these checks to test_autograd no need for a special file

Unfortunately, I believe there is a need. CudaSyncGuard sets a global variable, not a thread-local variable. If another thread in test_autograd.py happens to be running a test concurrently where host synchronization is okay, it will error out if the CudaSyncGuard context manager is active. Therefore, I need to run this test in its own proces (check our run_test.py to see how that is enabled). Does that make sense?

I suppose my concern is not really a concern if every test runs serially. Can I assume that this is the case? If so, I am okay making your proposed change.

You can move that to test/autograd/* folder then and make sure to set the appropriate flag in run_test.py to avoid parallelist.

@ngimel I guess it is expected for the CudaSyncGuard to be global and not thread local?

Also maybe @serialTest() would be a way to force the single test to run serially

Yes, CudaSyncGuard is global

albanD · 2025-08-04T14:33:13Z

torch/autograd/graph.py

            )
-            packed.copy_(tensor)
+            packed.copy_(tensor, non_blocking=actually_pin_memory)


Yes this hook is more of a simple showcase than anything extra fancy.

Is it guaranteed that offloading and onloading will run on the same stream?

Yes it is as the autograd engine will set the backward stream to be the forward one. So both will happen on the ambiant stream during the forward op call.

albanD · 2025-08-04T14:34:39Z

torch/autograd/graph.py

            )
-            packed.copy_(tensor)
+            packed.copy_(tensor, non_blocking=actually_pin_memory)


IIRC the reason we did this one was to ensure that the CPU side object is always correct, whichever way the user pokes at it. Since it is a showcase, we prioritized reliability over performance.

Remove accidental host synchronization in autograd cpu offload

67f5adf

galv requested review from a team, albanD and soulitzer as code owners August 2, 2025 00:40

galv requested a review from ngimel August 2, 2025 00:40

pytorchbot added the open source label Aug 2, 2025

ngimel reviewed Aug 2, 2025

View reviewed changes

galv added module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general labels Aug 3, 2025

albanD reviewed Aug 4, 2025

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove accidental host synchronization in autograd cpu offload #159698

Remove accidental host synchronization in autograd cpu offload #159698

Uh oh!

galv commented Aug 2, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

ngimel Aug 2, 2025

Uh oh!

galv Aug 3, 2025

Uh oh!

ngimel Aug 4, 2025

Uh oh!

albanD Aug 4, 2025

Uh oh!

galv Aug 5, 2025

Uh oh!

albanD Aug 5, 2025

Uh oh!

albanD Aug 4, 2025

Uh oh!

galv Aug 5, 2025

Uh oh!

albanD Aug 5, 2025

Uh oh!

albanD Aug 5, 2025

Uh oh!

ngimel Aug 5, 2025

Uh oh!

albanD Aug 4, 2025

Uh oh!

albanD Aug 4, 2025

Uh oh!

Uh oh!

Remove accidental host synchronization in autograd cpu offload #159698

Are you sure you want to change the base?

Remove accidental host synchronization in autograd cpu offload #159698

Uh oh!

Conversation

galv commented Aug 2, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159698

❌ 1 New Failure, 3 Unrelated Failures

Uh oh!

github-actions bot commented Aug 2, 2025

This PR needs a release notes: label

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galv commented Aug 2, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 2, 2025 •

edited

Loading

This PR needs a `release notes:` label