Support XPU in --nproc-per-node option to torchrun #159474

moksiuc · 2025-07-30T13:15:13Z

Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

pytorch-bot · 2025-07-30T13:15:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159474

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36a5a47 with merge base 1c2cba1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

moksiuc · 2025-07-30T13:20:13Z

@pytorchbot label "topic: not user facing"

torch/distributed/run.py

pytorch-bot · 2025-07-31T11:38:32Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

moksiuc · 2025-07-31T12:40:55Z

I had to make additional modification due to lint error:

Lint for torch/distributed/run.py:

Error (MYPY) [union-attr]
Item "None" of "device | None" has no attribute "type"

    709  |        elif nproc_per_node == "auto":
    710  |            if torch.accelerator.is_available():
    711  |                num_proc = torch.accelerator.device_count()
>>> 712  |                device_type = torch.accelerator.current_accelerator().type
    713  |            else:
    714  |                num_proc = os.cpu_count()
    715  |                device_type = "cpu"

torch/distributed/run.py

EikanWang · 2025-08-03T03:37:51Z

@moksiuc , could you help check the failure related to distributed/launcher/test_run.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations? Regarding the other two failures, they should be irrelevant to this PR.

moksiuc · 2025-08-04T07:51:39Z

@EikanWang
From the source of the test

self.assertSetEqual(
{str(i) for i in range(world_size)}, set(os.listdir(self.test_dir))
)

and error message

AssertionError: Items in the second set but not the first:
'3'
'7'
'6'
'5'
'4'

in the logs it looks that test was run on 8 nodes while the test expected it to be run on only 3.
I don't know what is the test machine configuration.

EikanWang · 2025-08-04T22:40:07Z

torch/distributed/run.py

@@ -413,7 +415,7 @@ def get_args_parser() -> ArgumentParser:
        action=env,
        type=str,
        default="1",
-        help="Number of workers per node; supported values: [auto, cpu, gpu, int].",
+        help="Number of workers per node; supported values: [auto, cpu, gpu, xpu, int].",


The gpu term here means CUDA. It would be better to call it out to avoid confusing users.

What do you mean ? Change it to cuda or remove completely ?

moksiuc · 2025-08-05T08:04:44Z

@guangyey
Debug of fails on CI shows the following results:
nproc_per_node=auto: device_type = 'cpu', num_proc = 8, os.cpu_count() = 8
torch.cuda.device_count() = 3
So, torch.accelerator does not include cuda. I'll have to leave cuda branch in 'auto' mode.

Debug code used:
a29dba1

guangyey

I can't believe that torch.accelerator doesn't include cuda.
I think the root cause is the UT itself.

@skip_but_pass_in_sandcastle_if(
        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
    )
    @patch("torch.cuda.is_available", return_value=True)
    @patch("torch.cuda.device_count", return_value=3)
    def test_nproc_gpu_launch_configurations(self, _mock1, _mock2):
        self._test_nproc_launch_configuration("auto", 3)
        self._test_nproc_launch_configuration("gpu", 3)

This UT is patched to torch.cuda.device_count always return 3 even though it is run on a CPU-only machine.

guangyey · 2025-08-05T08:46:33Z

I think we need to change that ut with some extra patches such as
@patch("torch.accelerator.is_available", return_value=True)
@patch("torch.accelerator.device_count", return_value=3)
@patch("torch.accelerator.current_accelerator", return_value=torch.device("cuda"))

This reverts commit a29dba1.

lint and ut fial

This reverts commit 98012b6.

moksiuc · 2025-08-05T09:40:14Z

Adjusted unit tests

test/distributed/launcher/test_run.py

guangyey

Thanks for updated! UT seems passed.

d4l3k · 2025-08-11T16:45:50Z

torch/distributed/run.py

@@ -694,21 +696,20 @@ def determine_local_world_size(nproc_per_node: str):
                raise ValueError("Cuda is not available.") from e
            device_type = "gpu"
            num_proc = torch.cuda.device_count()
+        elif nproc_per_node == "xpu":


instead of adding XPU here -- thoughts on making gpu above use torch.accelerator so it automatically works for both cuda/xpu without a new config knob?

This semantic is auto already defined at line 702.

d4l3k · 2025-08-11T16:46:51Z

torch/distributed/run.py

+            if not torch.xpu.is_available():
+                raise ValueError("Xpu is not available.") from e
+            device_type = "xpu"
+            num_proc = torch.xpu.device_count()
        elif nproc_per_node == torch._C._get_privateuse1_backend_name():


does XPU not trigger this code path? _get_privateuse1_backend_name should return xpu right?

XPU is an in-tree backend, so _get_privateuse1_backend_name will never return xpu

Support XPU in --nproc-per-node option to torchrun

aa0a90b

Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 30, 2025

pytorchbot added the open source label Jul 30, 2025

pytorch-bot bot added the topic: not user facing topic category label Jul 30, 2025

tsocha approved these changes Jul 30, 2025

View reviewed changes

albanD requested a review from d4l3k July 30, 2025 13:33

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 30, 2025

guangyey reviewed Jul 31, 2025

View reviewed changes

torch/distributed/run.py Outdated Show resolved Hide resolved

guangyey approved these changes Jul 31, 2025

View reviewed changes

guangyey added this to PyTorch Intel Jul 31, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 31, 2025

moksiuc added 2 commits July 31, 2025 14:35

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

c33e136

Apply review comment

249f7c2

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 31, 2025

guangyey added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Jul 31, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025

guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025

moksiuc added 2 commits July 31, 2025 15:39

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

2aa3ebc

Fix lint error

e38a821

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Jul 31, 2025

guangyey previously approved these changes Aug 1, 2025

View reviewed changes

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 1, 2025

guangyey reviewed Aug 1, 2025

View reviewed changes

torch/distributed/run.py Outdated Show resolved Hide resolved

Update torch/distributed/run.py

3bea668

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 1, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 2, 2025

moksiuc added 2 commits August 4, 2025 11:02

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

132abd4

debug

a29dba1

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 4, 2025

EikanWang reviewed Aug 4, 2025

View reviewed changes

guangyey reviewed Aug 5, 2025

View reviewed changes

moksiuc added 3 commits August 5, 2025 11:50

Revert "debug"

9123bf0

This reverts commit a29dba1.

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

a828d05

Check cuda before torch.accelerator

98012b6

guangyey self-requested a review August 5, 2025 09:26

moksiuc added 2 commits August 5, 2025 12:39

Revert "Check cuda before torch.accelerator"

7b0e8cf

This reverts commit 98012b6.

Adjust unit tests

3be536a

Typo in test name

c994788

guangyey reviewed Aug 5, 2025

View reviewed changes

test/distributed/launcher/test_run.py Outdated Show resolved Hide resolved

guangyey reviewed Aug 5, 2025

View reviewed changes

test/distributed/launcher/test_run.py Outdated Show resolved Hide resolved

guangyey added 2 commits August 5, 2025 18:51

Update test/distributed/launcher/test_run.py

696d34b

Update test/distributed/launcher/test_run.py

b7b86be

guangyey approved these changes Aug 5, 2025

View reviewed changes

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

a6e379c

guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 7, 2025

Merge branch 'refs/heads/main' into moksiucik_torchrun_xpu

36a5a47

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 11, 2025

d4l3k requested changes Aug 11, 2025

View reviewed changes

Support XPU in --nproc-per-node option to torchrun #159474

Are you sure you want to change the base?

Support XPU in --nproc-per-node option to torchrun #159474

Conversation

moksiuc commented Jul 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159474

✅ No Failures

Uh oh!

moksiuc commented Jul 30, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 31, 2025

Uh oh!

moksiuc commented Jul 31, 2025

Uh oh!

Uh oh!

EikanWang commented Aug 3, 2025

Uh oh!

moksiuc commented Aug 4, 2025

Uh oh!

EikanWang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

moksiuc Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

moksiuc commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guangyey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 5, 2025

Uh oh!

moksiuc commented Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moksiuc commented Jul 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

moksiuc commented Aug 5, 2025 •

edited

Loading

guangyey left a comment •

edited

Loading