Support of dtensor redistribute with device order #160266

zpcore · 2025-08-10T08:47:26Z

[Prototype; for RFC, not ready for review]

Now redistribute dtensor honors the device ordering. If no order information specified, it will use the default device order [0,1,2,...]. We can specify device_order as follow:

sharded_dt = distribute_tensor(input_data, mesh, placement, device_order)

and

out_dt = sharded_dt.redistribute(mesh, placement, device_order)

Note that device order information is added into the DTensorSpec. So redistribute_local_tensor doesn't need the src_device_order and dst_device_order. I leave them here as a reference for AutoParallel (cc @fmassa ). I will remove those order related args from redistributed related API in this PR.

Stack from ghstack (oldest at bottom):

-> Support of dtensor redistribute with device order #160266

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-08-10T08:47:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160266

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 12fe67c with merge base 24257f5 ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/tensor/test_redistribute.py::RedistributeTest::test_redistribute_shard_dim_change_complex64
pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 3, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/tensor/test_random_ops.py::DistTensorRandomOpTest::test_deterministic_uniform_2d
pull / linux-jammy-py3.9-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh)
distributed/tensor/test_redistribute.py::RedistributeTest::test_redistribute_shard_dim_change_complex64
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, lf.linux.2xlarge) (gh)
distributed/tensor/test_utils.py::Test2DStridedLocalShard::test_fsdp1_tp_2d_dtensor_local_shards_and_offsets

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh) (disabled by #153236)
distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py::TestClipGradNormWorldSize4::test_clip_grad_norm_2d

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: c5dd358 Pull Request resolved: #160266

fmassa · 2025-08-10T09:50:38Z

torch/distributed/tensor/_redistribute.py

+                    ):
+                        mesh_dim_size = device_mesh.size(mesh_dim=mesh_dim)
+                        current_placement = sorted_dst_placement[mesh_dim]
+                        assert isinstance(current_placement, Shard)


I might be doing something wrong when trying this out, but I'm hitting this assertion, where current_placement is Replicate.

Do you have instructions on how to reproduce the issue? I tried python examples/example_autoparallel.py and it complained RuntimeError: Function CompiledFunctionBackward returned an invalid gradient at index 0 - got [1, 6144] but expected shape compatible with [24, 6144].

[Prototype; for RFC, not ready for review] Now redistribute dtensor honors the device ordering. If no order information specified, it will use the default device order [0,1,2,...]. We can specify `device_order` as follow: ``` sharded_dt = distribute_tensor(input_data, mesh, placement, device_order) ``` and ``` out_dt = sharded_dt.redistribute(mesh, placement, device_order) ``` Note that device order information is added into the DTensorSpec. So `redistribute_local_tensor` doesn't need the `src_device_order` and `dst_device_order`. I leave them here as a reference for AutoParallel (cc fmassa ). I will remove those order related args from redistributed related API in this PR. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 0e27cda Pull Request resolved: #160266

[Prototype; for RFC, not ready for review] Now redistribute dtensor honors the device ordering. If no order information specified, it will use the default device order [0,1,2,...]. We can specify `device_order` as follow: ``` sharded_dt = distribute_tensor(input_data, mesh, placement, device_order) ``` and ``` out_dt = sharded_dt.redistribute(mesh, placement, device_order) ``` Note that device order information is added into the DTensorSpec. So `redistribute_local_tensor` doesn't need the `src_device_order` and `dst_device_order`. I leave them here as a reference for AutoParallel (cc fmassa ). I will remove those order related args from redistributed related API in this PR. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: ad23157 Pull Request resolved: #160266

Support of dtensor redistribute with device order

e99b775

[ghstack-poisoned]

zpcore added a commit that referenced this pull request Aug 10, 2025

Support of dtensor redistribute with device order

52cbd46

ghstack-source-id: c5dd358 Pull Request resolved: #160266

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 10, 2025

fmassa reviewed Aug 10, 2025

View reviewed changes

zpcore added a commit that referenced this pull request Aug 10, 2025

Support of dtensor redistribute with device order

3296df1

ghstack-source-id: 0e27cda Pull Request resolved: #160266

zpcore added a commit that referenced this pull request Aug 10, 2025

Support of dtensor redistribute with device order

0c153f1

ghstack-source-id: ad23157 Pull Request resolved: #160266

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support of dtensor redistribute with device order #160266

Support of dtensor redistribute with device order #160266

zpcore commented Aug 10, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading

Uh oh!

fmassa Aug 10, 2025

Uh oh!

zpcore Aug 11, 2025

Uh oh!

Uh oh!

Support of dtensor redistribute with device order #160266

Are you sure you want to change the base?

Support of dtensor redistribute with device order #160266

Conversation

zpcore commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160266

❌ 4 New Failures, 1 Unrelated Failure

Uh oh!

fmassa Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zpcore commented Aug 10, 2025 •

edited

Loading

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading