Skip to content

[PP] Refactor test_schedule_multiproc #158780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Jul 21, 2025

Stack from ghstack (oldest at bottom):

This refactors the pipelining schedule tests since a lot of them have the same repeated code of:

  1. Create pipelined model and reference model
  2. Run reference model and pipelined model
  3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

Copy link

pytorch-bot bot commented Jul 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158780

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 8993582 with merge base 8389244 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jul 21, 2025
H-Huang added a commit that referenced this pull request Jul 21, 2025
ghstack-source-id: 5538d26
Pull Request resolved: #158780
@H-Huang H-Huang changed the title Refactor test_schedule_multiproc [PP] Refactor test_schedule_multiproc Jul 21, 2025
@H-Huang H-Huang added the module: pipelining Pipeline Parallelism label Jul 21, 2025
@H-Huang H-Huang requested a review from kwen2501 July 21, 2025 21:25
@H-Huang H-Huang requested a review from wconstab July 24, 2025 18:35

# First, try the standard strict comparison
try:
torch.testing.assert_close(grad1, grad2, rtol=rtol, atol=atol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of even doing this assertion? We'll never know if we hit this case or fell through to the flexible case, so to me this amounts to making every test use the relative path. Is that what you wanted, or do you want some tests to be strict and some to be flexible?

Do we know why some tests were failing with the strict check?

Copy link
Member Author

@H-Huang H-Huang Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was kinda a hack. I'm not sure completely sure why the tests are failing with strict check but it does seem like it is because of precision loss due to communication. I only sporadically get errors that look something like this:

AssertionError: Tensor-likes are not close!

Mismatched elements: 66 / 262144 (0.0%)
Greatest absolute difference: 0.013273600488901138 at index (158, 163) (up to 0.005 allowed)
Greatest relative difference: 37.72719192504883 at index (158, 176) (up to 0.01 allowed)

To execute this test, run the following from the base repo dir:
    python test/distributed/pipelining/test_schedule_multiproc.py ScheduleTest.test_grad_with_manual_interleaved_ScheduleClass0_use_new_runtime_True

Only a tiny fraction of the elements (0.0002) are mismatched. I noticed that when I decrease the batch size (256 -> 64) I don't hit this error anymore, so I opted for that change now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i stamped already, but i want to understand what happened here. I'm glad you can just use the strict test. But did you figure out what was wrong, or is this going to be flaky now?

Copy link
Member Author

@H-Huang H-Huang Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand it better now. Since the loss_fn is MSELoss(reduction="sum") then increasing the samples in the batch will lead to larger loss values

The losses were quite large, for example in test_grad_with_manual and 2 microbatches I am seeing:

losses = [tensor(65603.1562, device='cuda:1', grad_fn=<MseLossBackward0>), tensor(65587.8594, device='cuda:1', grad_fn=<MseLossBackward0>)] 
ref_loss = tensor(131191., device='cuda:1', grad_fn=<MseLossBackward0>)

Already see we precision loss here because sum(losses) = 131191.0156 and ref_loss is just 131191. On top of this the floating point arithmetic isn't associative so (grad_from_loss1) + (grad_from_loss2)grad_from_(loss1 + loss2)

This explains why reducing the batch size 256 -> 64 helps. I also tried MSELoss(reduction="mean") and it passes as well

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you split into 2 PRs (1) the refactor part, which i'll stamp (2) the change in behavior of the tests, which i'm wondering about

@H-Huang
Copy link
Member Author

H-Huang commented Jul 28, 2025

Updated the PR to only use the strict grad check so no need to split it anymore

@H-Huang H-Huang requested a review from wconstab July 28, 2025 19:47
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).

cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta 

[ghstack-poisoned]
@H-Huang H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 29, 2025
H-Huang added 2 commits July 31, 2025 11:53
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).

cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta 

[ghstack-poisoned]
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).

cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta 

[ghstack-poisoned]
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for cleaning this up!

@H-Huang
Copy link
Member Author

H-Huang commented Aug 1, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants