[DTensor] add op support: aten.squeeze_.dim #159532

XilunWu · 2025-07-30T23:57:13Z

Stack from ghstack (oldest at bottom):

-> [DTensor] add op support: aten.squeeze_.dim #159532

Summary
This PR enables in-place op aten.squeeze_.dim on DTensor with a change to
DTensor dispatch logic: when processing in-place operator, we should assign
output_sharding.output_spec back to the first argument. This is because
the in-place op_call on arg._local_tensor could also shift the tensor meta.

Test
pytest test/distributed/tensor/test_view_ops.py -s -k test_squeeze_

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @tianyu-l

[ghstack-poisoned]

pytorch-bot · 2025-07-30T23:57:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159532

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

❌ 4 New Failures, 1 Unrelated Failure

As of commit 0477c8f with merge base ddbdcdc ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: GraphQL query
Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: GraphQL query
pull / linux-jammy-cuda12.8-py3.10-gcc11-test / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/tensor/test_math_ops.py::DistMathOpsTest::test_rms_norm_bwd
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
distributed/tensor/test_math_ops.py::DistMathOpsTest::test_rms_norm_bwd

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b4a560c Pull Request resolved: #159532

XilunWu · 2025-07-31T00:42:26Z

torch/distributed/tensor/_dispatch.py

+                output_spec = output_sharding.output_spec
+                assert isinstance(output_spec, DTensorSpec)
+                assert isinstance(args[0], DTensor)
+                args[0]._spec = output_spec


@tianyu-l pointed out that, besides DTensor._spec.tensor_meta, the subclass' metadata also needs override (but it seems not doable...)

Edward suggests try return_and_correct_aliasing to fix the outer tensor meta, will try.

confirmed that #158954 fixes the outer and inner aliasing mismatch issue. cc @ezyang

torch/distributed/tensor/_ops/_view_ops.py

test/distributed/tensor/test_view_ops.py

zpcore · 2025-07-31T05:43:45Z

test/distributed/tensor/test_view_ops.py

+        x = torch.randn((1, 4), device=self.device_type)
+        dist_x = DTensor.from_local(x, mesh_2d, [Partial(), Shard(1)])
+        self._test_op_on_dtensor(
+            torch.ops.aten.squeeze_.dim,


Should we also check if dist_x is changed or not?

torch/distributed/tensor/_dispatch.py

test/distributed/tensor/test_view_ops.py

**Summary** This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to DTensor dispatch logic: when processing in-place operator, we should assign `output_sharding.output_spec` back to the first argument. This is because the in-place op_call on `arg._local_tensor` could also shift the tensor meta. **Test** `pytest test/distributed/tensor/test_view_ops.py -s -k test_squeeze_` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta tianyu-l [ghstack-poisoned]

XilunWu · 2025-08-05T10:14:13Z

need to debug why "distributed/tensor/test_math_ops.py::DistMathOpsTest::test_rms_norm_bwd" is broken by this change

ghstack-source-id: fe05405 Pull Request resolved: #159532

AaronWang04 · 2025-08-07T01:00:48Z

@XilunWu this is the table dump of the forward pass. as you said yeah it is missing a collective and all the values are off :/

Maybe something with how rsqrt uses squeeze_dim? not sure rn I can investigate further if needed

Interesting part of the dump (note the partial vs replicate of rsqrt)
main branch

        **aten.mean.dim
        **aten.add_.Scalar
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.add_.Scalar
        **aten.rsqrt.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.rsqrt.default

this PR

        **aten.mean.dim
        **aten.add_.Scalar
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.add_.Scalar
        **aten.rsqrt.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.rsqrt.default

Full dump of forward pass
main branch

RMSNorm
    *module type: class 'torch.nn.modules.normalization.RMSNorm'
    *Parameter List
     *weight: (Replicate(),)
      FORWARD PASS
        *c10d_functional.all_reduce: 2
        **aten.view.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.view.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.view.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.view.default
        **aten._fused_rms_norm.default
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype
        **aten.pow.Tensor_Scalar
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.pow.Tensor_Scalar
        **aten.mean.dim
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.mean.dim
        **aten.add_.Scalar
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.add_.Scalar
        **aten.rsqrt.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.rsqrt.default
        **aten.mul.Tensor
          shape: [torch.Size([20, 5, 10]), torch.Size([1, 1, 1])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.mul.Tensor
        **aten.mul.Tensor
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.chunk.default
        **aten.clone.default
        **aten.mul.Tensor
        **aten.type_as.default
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype_layout
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default

this PR

RMSNorm
    *module type: class 'torch.nn.modules.normalization.RMSNorm'
    *Parameter List
     *weight: (Replicate(),)
      FORWARD PASS
        *c10d_functional.all_reduce: 1
        **aten.view.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.view.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.detach.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default
        **aten.view.default
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.view.default
        **aten._fused_rms_norm.default
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype
        **aten.pow.Tensor_Scalar
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.pow.Tensor_Scalar
        **aten.mean.dim
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.mean.dim
        **aten.add_.Scalar
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Partial(avg),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **_c10d_functional.all_reduce.default
        **_c10d_functional.wait_tensor.default
        **aten.add_.Scalar
        **aten.rsqrt.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.rsqrt.default
        **aten.mul.Tensor
          shape: [torch.Size([20, 5, 10]), torch.Size([1, 1, 1])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.mul.Tensor
        **aten.mul.Tensor
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.chunk.default
        **aten.clone.default
        **aten.mul.Tensor
        **aten.type_as.default
          shape: [torch.Size([20, 5, 10]), torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),), (Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.to.dtype_layout
          shape: [torch.Size([20, 5, 10])]
          sharding: [(Shard(dim=0),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
          shape: [torch.Size([1, 1, 1])]
          sharding: [(Replicate(),)]
          device mesh: DeviceMesh((4,), device: 'cuda', stride: (1,))
        **aten.detach.default
        **aten.detach.default

[DTensor] add op support: aten.squeeze_.dim

5180b93

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Jul 30, 2025

[DTensor] add op support: aten.squeeze_.dim

4963465

ghstack-source-id: b4a560c Pull Request resolved: #159532

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 30, 2025

XilunWu commented Jul 31, 2025

View reviewed changes

XilunWu requested review from wanchaol, wconstab, zpcore and tianyu-l July 31, 2025 00:43

XilunWu added topic: not user facing topic category module: dtensor distributed tensor tag labels Jul 31, 2025

zpcore reviewed Jul 31, 2025

View reviewed changes

torch/distributed/tensor/_ops/_view_ops.py Outdated Show resolved Hide resolved

zpcore reviewed Jul 31, 2025

View reviewed changes

test/distributed/tensor/test_view_ops.py Show resolved Hide resolved

zpcore reviewed Jul 31, 2025

View reviewed changes

wanchaol reviewed Jul 31, 2025

View reviewed changes

torch/distributed/tensor/_dispatch.py Outdated Show resolved Hide resolved

test/distributed/tensor/test_view_ops.py Show resolved Hide resolved

XilunWu added a commit that referenced this pull request Aug 5, 2025

[DTensor] add op support: aten.squeeze_.dim

3773243

ghstack-source-id: fe05405 Pull Request resolved: #159532

XilunWu marked this pull request as draft August 5, 2025 10:14

XilunWu mentioned this pull request Aug 6, 2025

[DTensor] Registers sharding rule for rms_norm #159692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DTensor] add op support: aten.squeeze_.dim #159532

[DTensor] add op support: aten.squeeze_.dim #159532

Uh oh!

XilunWu commented Jul 30, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

XilunWu Jul 31, 2025 •

edited

Loading

Uh oh!

XilunWu Jul 31, 2025

Uh oh!

XilunWu Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

zpcore Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

XilunWu commented Aug 5, 2025

Uh oh!

AaronWang04 commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

[DTensor] add op support: aten.squeeze_.dim #159532

Are you sure you want to change the base?

[DTensor] add op support: aten.squeeze_.dim #159532

Uh oh!

Conversation

XilunWu commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159532

❗ 1 Active SEVs

❌ 4 New Failures, 1 Unrelated Failure

Uh oh!

XilunWu Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XilunWu Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

XilunWu Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zpcore Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

XilunWu commented Aug 5, 2025

Uh oh!

AaronWang04 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

XilunWu commented Jul 30, 2025 •

edited

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

XilunWu Jul 31, 2025 •

edited

Loading

AaronWang04 commented Aug 7, 2025 •

edited

Loading