[FSDP] Reshard frozen params in backward #101982

awgu · 2023-05-22T16:26:58Z

Stack from ghstack (oldest at bottom):

This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass.

Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass.
The approach is to register a multi-grad ~~post~~-hook on the input activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard).

This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from register_multi_grad_hook(). I find that with register_multi_grad_hook(), sometimes the unit test counting the number of times _post_backward_reshard() is called fails (due to it not being called). This was resolved in #102859.

[ghstack-poisoned]

pytorch-bot · 2023-05-22T16:27:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101982

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ed8b8fc:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass. - Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass. - The approach is to register a multi-grad post-hook on the _input_ activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard). This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from `register_multi_grad_hook()`. I find that with `register_multi_grad_hook()`, sometimes the unit test counting the number of times `_post_backward_reshard()` is called fails (due to it not being called). [ghstack-poisoned]

ghstack-source-id: 71b0854 Pull Request resolved: #101982

rohan-varma · 2023-05-22T18:41:18Z

My understanding might be wrong here, but frozen parameters being resharded late (I'm assuming they're resharded in _catch_all_reshard) is similar to how we reshard "unused parameters" in the catch all reshard as well. Could we use this technique to completely eliminate the _catch_all_reshard?

awgu · 2023-05-22T18:58:40Z

@rohan-varma The catch-all reshard is still useful in the case that the very first input activation does not require gradient. Then, we have no activation on which we can reshard the root FSDP instance's parameters if they are frozen.

pengyanghua · 2023-06-08T14:12:13Z

@awgu @rohan-varma Is this MR ready for merge or not? What is the blocker?

awgu · 2023-06-08T14:44:14Z

@pengyanghua I was waiting for a fix on autograd side and have not had a chance to rebase it yet. I will do that soon.

This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass. - Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass. - The approach is to register a multi-grad post-hook on the _input_ activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard). This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from `register_multi_grad_hook()`. I find that with `register_multi_grad_hook()`, sometimes the unit test counting the number of times `_post_backward_reshard()` is called fails (due to it not being called). [ghstack-poisoned]

ghstack-source-id: 378e72f Pull Request resolved: #101982

rohan-varma

awesome work!

test/distributed/fsdp/test_fsdp_fine_tune.py

rohan-varma · 2023-06-08T20:43:04Z

torch/distributed/fsdp/_runtime_utils.py

                _p_assert(
-                    len(flat_param._post_backward_hook_state) == 2,
+                    post_backward_hook_state_len == 1


why can it be 1 or 2 now?

When the parameters are frozen (requires_grad=False), we do not have an AccumulateGrad object anymore, so the state looks like:

handle.flat_param._post_backward_hook_state = hook_handle

Normally, it is both the hook_handle and the acc_grad.

rohan-varma · 2023-06-08T20:56:20Z

torch/distributed/fsdp/_runtime_utils.py

@@ -1360,6 +1373,39 @@ def _register_post_backward_hooks(
        flat_param._post_backward_hook_state = (acc_grad, hook_handle)  # type: ignore[attr-defined]


+def _register_post_backward_reshard_only_hooks(


looks like now we have 2 paths where params can be resharded in post backward:

for flat parameters that don't require grad, do it via a hook on the input activations

for flat parameters that do require grad, do it with the standard post backward hook

could we unify and do both the reshards with the input activations hook?

Resharding on the input activation can be later than our existing post-backward hook, which may regress memory. We should keep both paths and view this newly added path as the fallback for the requires_grad=False case.

awgu · 2023-06-08T21:04:52Z

@pytorchbot merge

pytorchmergebot · 2023-06-08T21:06:44Z

Merge failed

Reason: HTTP Error 403: rate limit exceeded

Details for Dev Infra team

Raised by workflow job

awgu · 2023-06-08T21:10:31Z

@pytorchbot merge

pytorchmergebot · 2023-06-08T21:12:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jcyk · 2023-08-05T17:45:01Z

@awgu , thanks for the great work. but I just come across a problem that might be related.
when I freeze some parameters (by setting requires_grad = False)
I encounter the following error:

File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 2400, in _che
ck_storage_allocated
    _reshard(state, [handle], [free_unsharded_flat_param])
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 389, in _
reshard
    self._free_unsharded_flat_param()
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 1629, in _fre
e_unsharded_flat_param
    handle.reshard(free_unsharded_flat_param)
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 1600, in resh
ard
    _reshard(state, [handle], [free_unsharded_flat_param])
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 389, in _
reshard
    _p_assert(storage_size > 0, "Expects storage to be allocated")
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/utils.py", line 147, in _p_assert
    self._free_unsharded_flat_param()
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 1629, in _fre
e_unsharded_flat_param
    handle.reshard(free_unsharded_flat_param)
  File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 1600, in resh
ard
    raise AssertionError(s)
AssertionError: Expects storage to be allocated

any idea?

I am using the latest nightly version that I could find to support cuda117. That is torch-2.1.0.dev20230621+cu117-cp38-cp38-linux_x86_64.whl

awgu · 2023-08-13T11:49:07Z

@jcyk Would it be possible to share a repro?

[FSDP] Reshard frozen params in backward

ff97bdf

[ghstack-poisoned]

awgu mentioned this pull request May 22, 2023

Add _register_multi_post_grad_hook() #101981

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label May 22, 2023

awgu added the topic: improvements topic category label May 22, 2023

awgu pushed a commit that referenced this pull request May 22, 2023

[FSDP] Reshard frozen params in backward

78537d2

ghstack-source-id: 71b0854 Pull Request resolved: #101982

awgu mentioned this pull request May 23, 2023

[FSDP] FSDP with CPU offload consumes 1.65X more GPU memory when training models with most of the params frozen #91165

Open

awgu mentioned this pull request Jun 1, 2023

Avoid grad_fn getting garbage collected in multihooks #102175

Closed

awgu pushed a commit that referenced this pull request Jun 8, 2023

[FSDP] Reshard frozen params in backward

479673e

ghstack-source-id: 378e72f Pull Request resolved: #101982

awgu marked this pull request as ready for review June 8, 2023 17:47

awgu requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, kwen2501, wanchaol, fegin, fduwjj, kiukchung and d4l3k as code owners June 8, 2023 17:47

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 8, 2023

rohan-varma approved these changes Jun 8, 2023

View reviewed changes

pytorchmergebot added the merging label Jun 8, 2023

pytorchmergebot removed the merging label Jun 8, 2023

awgu mentioned this pull request Jun 8, 2023

[FSDP][Easy] Remove redundant var def in test #103270

Closed

pytorchmergebot added the merging label Jun 8, 2023

pytorchmergebot added Merged and removed merging labels Jun 8, 2023

pytorchmergebot closed this in 48056b1 Jun 8, 2023

facebook-github-bot deleted the gh/awgu/400/head branch June 12, 2023 14:15

awgu mentioned this pull request Jul 7, 2023

FSDP - Allow to reshard frozen parameters in the backward pass, for parameter-efficient trainings of much larger models #95805

Closed

awgu mentioned this pull request Jan 12, 2024

Added reshard hook for frozen params in backward facebookresearch/fairscale#1159

Closed

10 tasks

JonSnow1807 mentioned this pull request Aug 9, 2025

[FSDP] Add FrozenParamHandle to optimize memory for frozen parameters #159751

Open

		@@ -1360,6 +1373,39 @@ def _register_post_backward_hooks(
		flat_param._post_backward_hook_state = (acc_grad, hook_handle) # type: ignore[attr-defined]


		def _register_post_backward_reshard_only_hooks(

[FSDP] Reshard frozen params in backward #101982

[FSDP] Reshard frozen params in backward #101982

Uh oh!

Conversation

awgu commented May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101982

✅ No Failures

Uh oh!

rohan-varma commented May 22, 2023

Uh oh!

awgu commented May 22, 2023

Uh oh!

pengyanghua commented Jun 8, 2023

Uh oh!

awgu commented Jun 8, 2023

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rohan-varma Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

awgu Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

rohan-varma Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

awgu Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

awgu commented Jun 8, 2023

Uh oh!

pytorchmergebot commented Jun 8, 2023

Merge failed

Uh oh!

awgu commented Jun 8, 2023

Uh oh!

pytorchmergebot commented Jun 8, 2023

Merge started

Uh oh!

jcyk commented Aug 5, 2023

Uh oh!

awgu commented Aug 13, 2023

Uh oh!

Uh oh!

awgu commented May 22, 2023 •

edited

Loading

pytorch-bot bot commented May 22, 2023 •

edited

Loading