-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Open
Labels
module: DeviceMeshoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
When calling dist.initialize_dist
, I can specify a dist_timeout
argument.
When training with FSDP and device_mesh, I want to call from torch.distributed._tensor import init_device_mesh
and pass the device_mesh
into FSDP. However, it seems that the process groups created do not respect dist_timeout
through this process
pytorch/torch/distributed/device_mesh.py
Line 292 in 5d6e323
dim_group = new_group(ranks=subgroup_ranks) |
Versions
Torch nightly 1/10/2024
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @LucasLLC
Metadata
Metadata
Assignees
Labels
module: DeviceMeshoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module