-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Support XPU in --nproc-per-node option to torchrun #159474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159474
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 36a5a47 with merge base 1c2cba1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
I had to make additional modification due to lint error:
Error (MYPY) [union-attr]
|
@moksiuc , could you help check the failure related to |
@EikanWang self.assertSetEqual( and error message AssertionError: Items in the second set but not the first: in the logs it looks that test was run on 8 nodes while the test expected it to be run on only 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't believe that torch.accelerator
doesn't include cuda.
I think the root cause is the UT itself.
@skip_but_pass_in_sandcastle_if(
TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan"
)
@patch("torch.cuda.is_available", return_value=True)
@patch("torch.cuda.device_count", return_value=3)
def test_nproc_gpu_launch_configurations(self, _mock1, _mock2):
self._test_nproc_launch_configuration("auto", 3)
self._test_nproc_launch_configuration("gpu", 3)
This UT is patched to torch.cuda.device_count
always return 3 even though it is run on a CPU-only machine.
This reverts commit 98012b6.
Adjusted unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updated! UT seems passed.
@@ -694,21 +696,20 @@ def determine_local_world_size(nproc_per_node: str): | |||
raise ValueError("Cuda is not available.") from e | |||
device_type = "gpu" | |||
num_proc = torch.cuda.device_count() | |||
elif nproc_per_node == "xpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding XPU here -- thoughts on making gpu
above use torch.accelerator
so it automatically works for both cuda/xpu without a new config knob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This semantic is auto
already defined at line 702.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nproc_per_node is provided by user of the script, 'xpu' can be provided so it has to be handled
if not torch.xpu.is_available(): | ||
raise ValueError("Xpu is not available.") from e | ||
device_type = "xpu" | ||
num_proc = torch.xpu.device_count() | ||
elif nproc_per_node == torch._C._get_privateuse1_backend_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does XPU not trigger this code path? _get_privateuse1_backend_name
should return xpu
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XPU is an in-tree backend, so _get_privateuse1_backend_name
will never return xpu
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta