[PP] Refactor test_schedule_multiproc #158780

H-Huang · 2025-07-21T21:02:21Z

Stack from ghstack (oldest at bottom):

This refactors the pipelining schedule tests since a lot of them have the same repeated code of:

Create pipelined model and reference model
Run reference model and pipelined model
compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-07-21T21:02:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158780

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 8993582 with merge base 8389244 ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh)
Action 'https://api.github.com/repos/pytorch/pytorch/tarball/06395276e4bbb06f396fcee6f3240741fdd0b63c' download has timed out. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5538d26 Pull Request resolved: #158780

test/distributed/pipelining/test_schedule_multiproc.py

wconstab · 2025-07-24T18:44:49Z

test/distributed/pipelining/test_schedule_multiproc.py

+
+        # First, try the standard strict comparison
+        try:
+            torch.testing.assert_close(grad1, grad2, rtol=rtol, atol=atol)


what's the point of even doing this assertion? We'll never know if we hit this case or fell through to the flexible case, so to me this amounts to making every test use the relative path. Is that what you wanted, or do you want some tests to be strict and some to be flexible?

Do we know why some tests were failing with the strict check?

Yeah this was kinda a hack. I'm not sure completely sure why the tests are failing with strict check but it does seem like it is because of precision loss due to communication. I only sporadically get errors that look something like this:

AssertionError: Tensor-likes are not close! Mismatched elements: 66 / 262144 (0.0%) Greatest absolute difference: 0.013273600488901138 at index (158, 163) (up to 0.005 allowed) Greatest relative difference: 37.72719192504883 at index (158, 176) (up to 0.01 allowed) To execute this test, run the following from the base repo dir: python test/distributed/pipelining/test_schedule_multiproc.py ScheduleTest.test_grad_with_manual_interleaved_ScheduleClass0_use_new_runtime_True

Only a tiny fraction of the elements (0.0002) are mismatched. I noticed that when I decrease the batch size (256 -> 64) I don't hit this error anymore, so I opted for that change now.

i stamped already, but i want to understand what happened here. I'm glad you can just use the strict test. But did you figure out what was wrong, or is this going to be flaky now?

I think I understand it better now. Since the loss_fn is MSELoss(reduction="sum") then increasing the samples in the batch will lead to larger loss values

The losses were quite large, for example in test_grad_with_manual and 2 microbatches I am seeing:

losses = [tensor(65603.1562, device='cuda:1', grad_fn=<MseLossBackward0>), tensor(65587.8594, device='cuda:1', grad_fn=<MseLossBackward0>)] ref_loss = tensor(131191., device='cuda:1', grad_fn=<MseLossBackward0>)

Already see we precision loss here because sum(losses) = 131191.0156 and ref_loss is just 131191. On top of this the floating point arithmetic isn't associative so (grad_from_loss1) + (grad_from_loss2) ≠ grad_from_(loss1 + loss2)

This explains why reducing the batch size 256 -> 64 helps. I also tried MSELoss(reduction="mean") and it passes as well

wconstab

could you split into 2 PRs (1) the refactor part, which i'll stamp (2) the change in behavior of the tests, which i'm wondering about

H-Huang · 2025-07-28T19:46:58Z

Updated the PR to only use the strict grad check so no need to split it anymore

This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408). cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

wconstab

thanks for cleaning this up!

H-Huang · 2025-08-01T14:54:39Z

@pytorchbot merge -i

pytorchmergebot · 2025-08-01T14:56:34Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Refactor test_schedule_multiproc

53519dd

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jul 21, 2025

H-Huang added a commit that referenced this pull request Jul 21, 2025

Refactor test_schedule_multiproc

c004706

ghstack-source-id: 5538d26 Pull Request resolved: #158780

H-Huang changed the title ~~Refactor test_schedule_multiproc~~ [PP] Refactor test_schedule_multiproc Jul 21, 2025

H-Huang added the module: pipelining Pipeline Parallelism label Jul 21, 2025

H-Huang requested a review from kwen2501 July 21, 2025 21:25

H-Huang mentioned this pull request Jul 23, 2025

[PP] Support OVERLAP_F_B computation type #158978

Closed

H-Huang requested a review from wconstab July 24, 2025 18:35

wconstab reviewed Jul 24, 2025

View reviewed changes

test/distributed/pipelining/test_schedule_multiproc.py Outdated Show resolved Hide resolved

wconstab reviewed Jul 24, 2025

View reviewed changes

test/distributed/pipelining/test_schedule_multiproc.py Outdated Show resolved Hide resolved

wconstab reviewed Jul 24, 2025

View reviewed changes

H-Huang requested a review from wconstab July 28, 2025 19:47

H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 29, 2025

H-Huang added 2 commits July 31, 2025 11:53

H-Huang mentioned this pull request Jul 31, 2025

[PP] Add DualPipeV schedule #159591

Open

wconstab approved these changes Jul 31, 2025

View reviewed changes

pytorchmergebot added the merging label Aug 1, 2025

pytorchmergebot closed this in b0b3e6e Aug 1, 2025

pytorchmergebot added Merged and removed merging labels Aug 1, 2025

H-Huang mentioned this pull request Aug 1, 2025

DISABLED test_grad_with_manual_ScheduleClass0_shape_inference_False (__main__.ScheduleTest) #159644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PP] Refactor test_schedule_multiproc #158780

[PP] Refactor test_schedule_multiproc #158780

Uh oh!

H-Huang commented Jul 21, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

wconstab Jul 24, 2025

Uh oh!

H-Huang Jul 28, 2025 •

edited

Loading

Uh oh!

wconstab Jul 31, 2025

Uh oh!

H-Huang Aug 1, 2025 •

edited

Loading

Uh oh!

wconstab left a comment

Uh oh!

H-Huang commented Jul 28, 2025

Uh oh!

wconstab left a comment

Uh oh!

H-Huang commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Uh oh!

Uh oh!

[PP] Refactor test_schedule_multiproc #158780

[PP] Refactor test_schedule_multiproc #158780

Uh oh!

Conversation

H-Huang commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158780

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Uh oh!

Uh oh!

wconstab Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Jul 28, 2025

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Merge started

Uh oh!

Uh oh!

H-Huang commented Jul 21, 2025 •

edited

Loading

pytorch-bot bot commented Jul 21, 2025 •

edited

Loading

H-Huang Jul 28, 2025 •

edited

Loading

H-Huang Aug 1, 2025 •

edited

Loading