Skip to content

Cannot configure dist_timeout when using device_mesh #119574

@mvpatel2000

Description

@mvpatel2000

🐛 Describe the bug

When calling dist.initialize_dist, I can specify a dist_timeout argument.

When training with FSDP and device_mesh, I want to call from torch.distributed._tensor import init_device_mesh and pass the device_mesh into FSDP. However, it seems that the process groups created do not respect dist_timeout through this process

dim_group = new_group(ranks=subgroup_ranks)

Versions

Torch nightly 1/10/2024

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @LucasLLC

Metadata

Metadata

Assignees

Labels

module: DeviceMeshoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions