Merge pull request #985 from pytorch/holly1238-patch-1

bettliou · web-flow · commit 016aac865dfe · 2022-03-25T20:52:44.000-04:00
Update 2022-3-14-introducing-pytorch-fully-sharded-data-parallel-api.md
diff --git a/_posts/2022-3-14-introducing-pytorch-fully-sharded-data-parallel-api.md b/_posts/2022-3-14-introducing-pytorch-fully-sharded-data-parallel-api.md
@@ -41,7 +41,7 @@ There are two ways to wrap a model with PyTorch FSDP. Auto wrapping is a drop-in
 
 Model layers should be wrapped in FSDP in a nested way to save peak memory and enable communication and computation overlapping. The simplest way to do it is auto wrapping, which can serve as a drop-in replacement for DDP without changing the rest of the code.
 
-fsdp_auto_wrap_policy argument allows specifying a callable function to recursively wrap layers with FSDP. default_auto_wrap_policy function provided by the PyTorch FSDP recursively wraps layers with the number of parameters larger than 100M. You can supply your own wrapping policy as needed. The example of writing a customized wrapping policy is shown in the [FSDP API doc](https://docs-preview.pytorch.org/72084/fsdp.html?highlight=fsdp#module-torch.distributed.fsdp).
+fsdp_auto_wrap_policy argument allows specifying a callable function to recursively wrap layers with FSDP. default_auto_wrap_policy function provided by the PyTorch FSDP recursively wraps layers with the number of parameters larger than 100M. You can supply your own wrapping policy as needed. The example of writing a customized wrapping policy is shown in the [FSDP API doc](https://pytorch.org/docs/stable/fsdp.html).
 
 In addition, cpu_offload could be configured optionally to offload wrapped parameters to CPUs when these parameters are not used in computation. This can further improve memory efficiency at the cost of data transfer overhead between host and device.