Skip to content

[Don't merge]port 2 distributed pipeline test files for Intel GPU #159140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wincent8
Copy link
Contributor

@wincent8 wincent8 commented Jul 25, 2025

it's another pr to port distributed pipeline test for Intel GPU, while the other pr is #159033.
In this pr, we port two test files for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

  1. instantiate_device_type_tests()
  2. adjust atol/rtol in torch.testing.assert_close to fix the accuracy gap introduced by oneDNN non-deterministic

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link

pytorch-bot bot commented Jul 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159140

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 0b94210 with merge base aaa384b (image):

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jul 25, 2025
@wincent8
Copy link
Contributor Author

@pytorchbot label "module: xpu"

@pytorch-bot pytorch-bot bot added the module: xpu Intel XPU related issues label Jul 25, 2025
@wincent8
Copy link
Contributor Author

@pytorchbot label "triaged"

@pytorch-bot pytorch-bot bot added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 25, 2025
@guangyey guangyey moved this to Pre-Review Required in PyTorch Intel Jul 26, 2025
@guangyey guangyey requested a review from EikanWang July 26, 2025 08:57
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 26, 2025
Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let @EikanWang make the final stamp.

@guangyey guangyey changed the title [WIP]port 2 distributed pipeline test files for Intel GPU port 2 distributed pipeline test files for Intel GPU Jul 29, 2025
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4l3k d4l3k requested a review from H-Huang August 11, 2025 16:40
@@ -124,11 +131,18 @@ def test_stage_backward_weight(self, device):
ref_loss = loss_fn(ref_out, ref_target)
ref_loss.backward()

rtol, atol = None, None
if self.device_type == "xpu":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having device specific logic in the tests is not ideal. Kind of curious, is this only hit for this test or should this logic be in the torch.testing.assert_close() util method?

Also if accuracy gap only happens for non-deterministic tests, can we just make the test deterministic?

Copy link
Collaborator

@guangyey guangyey Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wincent8 Let's try to set the deterministic option instead of tolerance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having device specific logic in the tests is not ideal. Kind of curious, is this only hit for this test or should this logic be in the torch.testing.assert_close() util method?

Also if accuracy gap only happens for non-deterministic tests, can we just make the test deterministic?

TBH, only these cases exhibit accuracy gaps in non-deterministic tests. For all other cases, XPU behaves as expected.

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @H-Huang's comments, avoid adding device-specific logic here.
@wincent8 Let's put this PR on hold until we determine a reasonable solution.

@wincent8
Copy link
Contributor Author

Following @H-Huang's comments, avoid adding device-specific logic here. @wincent8 Let's put this PR on hold until we determine a reasonable solution.

sure

@wincent8 wincent8 changed the title port 2 distributed pipeline test files for Intel GPU [Don't merge]port 2 distributed pipeline test files for Intel GPU Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/xpu Run XPU CI tasks module: xpu Intel XPU related issues oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Pre-Review Required
Development

Successfully merging this pull request may close these issues.

5 participants