[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

wincent8 · 2025-07-25T08:38:37Z

it's another pr to port distributed pipeline test for Intel GPU, while the other pr is #159033.
In this pr, we port two test files for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

instantiate_device_type_tests()
adjust atol/rtol in torch.testing.assert_close to fix the accuracy gap introduced by oneDNN non-deterministic

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-07-25T08:38:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159140

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 0b94210 with merge base aaa384b ():

CANCELLED JOB - The following job was cancelled. Please retry:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 3, 6, linux.idc.xpu) (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 1, 6, linux.idc.xpu) (gh) (similar failure)
test_testing.py::TestImports::test_circular_dependencies

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wincent8 · 2025-07-25T08:39:09Z

@pytorchbot label "module: xpu"

wincent8 · 2025-07-25T08:39:29Z

@pytorchbot label "triaged"

…_pipeline_2

test/distributed/pipelining/test_backward.py

guangyey

LGTM. Let @EikanWang make the final stamp.

d4l3k

LGTM

H-Huang · 2025-08-11T19:06:24Z

test/distributed/pipelining/test_backward.py

@@ -124,11 +131,18 @@ def test_stage_backward_weight(self, device):
        ref_loss = loss_fn(ref_out, ref_target)
        ref_loss.backward()

+        rtol, atol = None, None
+        if self.device_type == "xpu":


having device specific logic in the tests is not ideal. Kind of curious, is this only hit for this test or should this logic be in the torch.testing.assert_close() util method?

Also if accuracy gap only happens for non-deterministic tests, can we just make the test deterministic?

@wincent8 Let's try to set the deterministic option instead of tolerance.

having device specific logic in the tests is not ideal. Kind of curious, is this only hit for this test or should this logic be in the torch.testing.assert_close() util method?

Also if accuracy gap only happens for non-deterministic tests, can we just make the test deterministic?

TBH, only these cases exhibit accuracy gaps in non-deterministic tests. For all other cases, XPU behaves as expected.

guangyey

Following @H-Huang's comments, avoid adding device-specific logic here.
@wincent8 Let's put this PR on hold until we determine a reasonable solution.

wincent8 · 2025-08-12T07:42:05Z

Following @H-Huang's comments, avoid adding device-specific logic here. @wincent8 Let's put this PR on hold until we determine a reasonable solution.

sure

fix accuracy gap introduced by oneDNN non-deterministic

ed00ca2

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jul 25, 2025

pytorch-bot bot added the module: xpu Intel XPU related issues label Jul 25, 2025

pytorch-bot bot added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 25, 2025

Merge remote-tracking branch 'origin/wliao2/baseline' into wliao2/add…

0b94210

…_pipeline_2

pytorchbot added the open source label Jul 25, 2025

guangyey added this to PyTorch Intel Jul 26, 2025

guangyey moved this to Pre-Review Required in PyTorch Intel Jul 26, 2025

guangyey reviewed Jul 26, 2025

View reviewed changes

test/distributed/pipelining/test_backward.py Show resolved Hide resolved

guangyey requested a review from EikanWang July 26, 2025 08:57

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 26, 2025

guangyey reviewed Jul 28, 2025

View reviewed changes

test/distributed/pipelining/test_backward.py Show resolved Hide resolved

guangyey approved these changes Jul 29, 2025

View reviewed changes

guangyey changed the title ~~[WIP]port 2 distributed pipeline test files for Intel GPU~~ port 2 distributed pipeline test files for Intel GPU Jul 29, 2025

d4l3k approved these changes Aug 11, 2025

View reviewed changes

d4l3k requested a review from H-Huang August 11, 2025 16:40

H-Huang reviewed Aug 11, 2025

View reviewed changes

guangyey requested changes Aug 12, 2025

View reviewed changes

wincent8 changed the title ~~port 2 distributed pipeline test files for Intel GPU~~ [Don't merge]port 2 distributed pipeline test files for Intel GPU Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

wincent8 commented Jul 25, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

guangyey left a comment

Uh oh!

d4l3k left a comment

Uh oh!

H-Huang Aug 11, 2025

Uh oh!

guangyey Aug 12, 2025 •

edited

Loading

Uh oh!

guangyey Aug 12, 2025

Uh oh!

guangyey left a comment

Uh oh!

wincent8 commented Aug 12, 2025

Uh oh!

Uh oh!

[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

Are you sure you want to change the base?

[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

Conversation

wincent8 commented Jul 25, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159140

❌ 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

wincent8 commented Aug 12, 2025

Uh oh!

Uh oh!

wincent8 commented Jul 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading

guangyey Aug 12, 2025 •

edited

Loading