Skip to content

Support XPU in --nproc-per-node option to torchrun #159474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

moksiuc
Copy link
Contributor

@moksiuc moksiuc commented Jul 30, 2025

Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

Support both --nproc-per-node=xpu and autodetection of XPU
device in case of --nproc-per-node=auto
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159474

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36a5a47 with merge base 1c2cba1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 30, 2025
@moksiuc
Copy link
Contributor Author

moksiuc commented Jul 30, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 30, 2025
@albanD albanD requested a review from d4l3k July 30, 2025 13:33
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 30, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@guangyey guangyey added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Jul 31, 2025
Copy link

pytorch-bot bot commented Jul 31, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025
@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Jul 31, 2025
@moksiuc
Copy link
Contributor Author

moksiuc commented Jul 31, 2025

I had to make additional modification due to lint error:

Lint for torch/distributed/run.py:

Error (MYPY) [union-attr]
Item "None" of "device | None" has no attribute "type"

    709  |        elif nproc_per_node == "auto":
    710  |            if torch.accelerator.is_available():
    711  |                num_proc = torch.accelerator.device_count()
>>> 712  |                device_type = torch.accelerator.current_accelerator().type
    713  |            else:
    714  |                num_proc = os.cpu_count()
    715  |                device_type = "cpu"

guangyey
guangyey previously approved these changes Aug 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 1, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 2, 2025
@EikanWang
Copy link
Collaborator

@moksiuc , could you help check the failure related to distributed/launcher/test_run.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations? Regarding the other two failures, they should be irrelevant to this PR.

@moksiuc
Copy link
Contributor Author

moksiuc commented Aug 4, 2025

@EikanWang
From the source of the test

self.assertSetEqual(
{str(i) for i in range(world_size)}, set(os.listdir(self.test_dir))
)

and error message

AssertionError: Items in the second set but not the first:
'3'
'7'
'6'
'5'
'4'

in the logs it looks that test was run on 8 nodes while the test expected it to be run on only 3.
I don't know what is the test machine configuration.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 4, 2025
@@ -413,7 +415,7 @@ def get_args_parser() -> ArgumentParser:
action=env,
type=str,
default="1",
help="Number of workers per node; supported values: [auto, cpu, gpu, int].",
help="Number of workers per node; supported values: [auto, cpu, gpu, xpu, int].",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gpu term here means CUDA. It would be better to call it out to avoid confusing users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean ? Change it to cuda or remove completely ?

@moksiuc
Copy link
Contributor Author

moksiuc commented Aug 5, 2025

@guangyey
Debug of fails on CI shows the following results:
nproc_per_node=auto: device_type = 'cpu', num_proc = 8, os.cpu_count() = 8
torch.cuda.device_count() = 3
So, torch.accelerator does not include cuda. I'll have to leave cuda branch in 'auto' mode.

Debug code used:
a29dba1

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't believe that torch.accelerator doesn't include cuda.
I think the root cause is the UT itself.

@skip_but_pass_in_sandcastle_if(
        TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
    )
    @patch("torch.cuda.is_available", return_value=True)
    @patch("torch.cuda.device_count", return_value=3)
    def test_nproc_gpu_launch_configurations(self, _mock1, _mock2):
        self._test_nproc_launch_configuration("auto", 3)
        self._test_nproc_launch_configuration("gpu", 3)

This UT is patched to torch.cuda.device_count always return 3 even though it is run on a CPU-only machine.

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

I think we need to change that ut with some extra patches such as
@patch("torch.accelerator.is_available", return_value=True)
@patch("torch.accelerator.device_count", return_value=3)
@patch("torch.accelerator.current_accelerator", return_value=torch.device("cuda"))

@guangyey guangyey self-requested a review August 5, 2025 09:26
@guangyey guangyey dismissed their stale review August 5, 2025 09:27

lint and ut fial

@moksiuc
Copy link
Contributor Author

moksiuc commented Aug 5, 2025

Adjusted unit tests

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updated! UT seems passed.

@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 7, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 11, 2025
@@ -694,21 +696,20 @@ def determine_local_world_size(nproc_per_node: str):
raise ValueError("Cuda is not available.") from e
device_type = "gpu"
num_proc = torch.cuda.device_count()
elif nproc_per_node == "xpu":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding XPU here -- thoughts on making gpu above use torch.accelerator so it automatically works for both cuda/xpu without a new config knob?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This semantic is auto already defined at line 702.

if not torch.xpu.is_available():
raise ValueError("Xpu is not available.") from e
device_type = "xpu"
num_proc = torch.xpu.device_count()
elif nproc_per_node == torch._C._get_privateuse1_backend_name():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does XPU not trigger this code path? _get_privateuse1_backend_name should return xpu right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XPU is an in-tree backend, so _get_privateuse1_backend_name will never return xpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

7 participants