[FSDP2] cast `unsharded_param_grad` to correct reduce dtype #160279

tonyf · 2025-08-10T21:36:14Z

This is an edge case when using gradient accumulation steps with FSDP2.

If a parameter within a parameter group doesnt have gradients for all but the final backwards pass, the grad is not casted to the reduce dtype resulting in inconsistent dtype between gradients for reduce_scatter.

Specifically, if using a MixedPrecisionPolicy with param_dtype=torch.bfloat16 and reduce_dtype=torch.float32, parameters that had grads for both steps will have gradients of dtype torch.float32, but those that have a gradient only on the final pass will have gradients of dtype torch.bfloat16. This raises:

AssertionError: FSDP reduce-scatter expects uniform gradient dtype but got {torch.bfloat16, torch.float32}

This pr explicitly casts the grad to the correct dtype for the given case.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot · 2025-08-10T21:36:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160279

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 12 Pending, 2 Unrelated Failures

As of commit 42b057a with merge base 05c19d1 ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 1.
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-08-10T21:36:20Z

✅login: tonyf / (42b057a)

The committers listed above are authorized under a signed CLA.

cast unsharded param to reduce dtype

42b057a

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Aug 10, 2025

pytorchbot added the open source label Aug 10, 2025

tonyf changed the title ~~[FSDP2] cast unsharded param to reduce dtype~~ [FSDP2] cast unsharded_param_grad to reduce dtype Aug 10, 2025

pytorch-bot bot added the ciflow/inductor label Aug 10, 2025

tonyf changed the title ~~[FSDP2] cast unsharded_param_grad to reduce dtype~~ [FSDP2] cast unsharded_param_grad to correct reduce dtype Aug 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] cast `unsharded_param_grad` to correct reduce dtype #160279

[FSDP2] cast `unsharded_param_grad` to correct reduce dtype #160279

tonyf commented Aug 10, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Aug 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

[FSDP2] cast unsharded_param_grad to correct reduce dtype #160279

Are you sure you want to change the base?

[FSDP2] cast unsharded_param_grad to correct reduce dtype #160279

Conversation

tonyf commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160279

❌ 1 New Failure, 12 Pending, 2 Unrelated Failures

Uh oh!

linux-foundation-easycla bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[FSDP2] cast `unsharded_param_grad` to correct reduce dtype #160279

[FSDP2] cast `unsharded_param_grad` to correct reduce dtype #160279

tonyf commented Aug 10, 2025 •

edited

Loading

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 10, 2025 •

edited

Loading