DDP+TP composition does not work as expected

I encountered a variety of issues while trying to adopt a combination of DistributedDataParallel and DTensor based tensor parallelism. Some specific to DDP+TP, some more general. 

This seems to be somewhat known since e.g. torchtitan disallows combining DDP with other parallelization strategies https://github.com/pytorch/torchtitan/blob/c08c9d4962ea843dd786b850d1716861955b9a9f/torchtitan/models/llama3/infra/parallelize.py#L123 but I couldn't find any explicit issue tracking it, hence I put my findings here.

1) The logic designed to support tensor parallelism within DDP does not operate as documented https://github.com/pytorch/pytorch/commits/v2.7.0/torch/distributed/tensor/parallel/ddp.py . In principle it should:
A) Convert from DTensor to local tensor at init time.
B) Convert from local tensor to DTensor in pre-forward.
C) Convert from DTensor to local tensor in post-forward.
In practice, once (B) fires for the first time, all parameters are replaced with non-leaf DTensor by https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/_data_parallel_utils.py#L32 which means the model becomes parameter-less, which means (C) never fires because it's designed to iterate over parameters https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/ddp.py#L59.
As a consequence of this, the model appears to be stateless when invoking `state_dict()`, since all parameters have been deleted, so neither standard state APIs nor DCP work.

Even if the (C) was called correctly, it's not clear to me that allocating a new parameter on every forward would make sense w.r.t. to the optimizer. Doing a `swap_tensors` is also not possible (last time I checked) since torch.Tensor and DTensor have different slots, so I think the original parameters created in (A) need to be preserved. On top of that, you would have that this tensor hook would be called multiple times https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/_data_parallel_utils.py#L44 .
Note that while I'm talking specifically about `DistributedDataParallel`, `replicate` uses the same conversion hooks to compose with TP https://github.com/pytorch/pytorch/blob/ab2294d8289a7757a2fc321cdefac88e2b378edf/torch/distributed/_composable/replicate.py#L225


2) Composition with activation checkpointing is also broken since the two hooks are called once per-model forward, rather than once per-submodule forward. So if activation checkpointing invokes a submodule, DTensor conversion doesn't happen. 
Additionally, even in the scenario where hooks operated per submodule, early stopping being enabled by default is also problematic https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/utils/checkpoint.py#L734  since it can lead to post forward hooks to be skipped (unless they are registered with always_call, which is not the case). To me this appears to be problematic *in general* for any hook based implementation, not just TP+DDP.  
 

3) Propagation of requires_grad is fairly inconsistent. It's not propagated at init time here https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/ddp.py#L64 nor in any of the TP style (e.g. https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/style.py#L122 ) and it's also set inplace here unconditionally https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/parallel/_data_parallel_utils.py#L26 over the DTensor, for reasons not obvious to me.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP+TP composition does not work as expected #157445

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP+TP composition does not work as expected #157445

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions