-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[PP] Refactor test_schedule_multiproc #158780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158780
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 8993582 with merge base 8389244 ( NEW FAILURE - The following job has failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
||
# First, try the standard strict comparison | ||
try: | ||
torch.testing.assert_close(grad1, grad2, rtol=rtol, atol=atol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of even doing this assertion? We'll never know if we hit this case or fell through to the flexible case, so to me this amounts to making every test use the relative path. Is that what you wanted, or do you want some tests to be strict and some to be flexible?
Do we know why some tests were failing with the strict check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was kinda a hack. I'm not sure completely sure why the tests are failing with strict check but it does seem like it is because of precision loss due to communication. I only sporadically get errors that look something like this:
AssertionError: Tensor-likes are not close!
Mismatched elements: 66 / 262144 (0.0%)
Greatest absolute difference: 0.013273600488901138 at index (158, 163) (up to 0.005 allowed)
Greatest relative difference: 37.72719192504883 at index (158, 176) (up to 0.01 allowed)
To execute this test, run the following from the base repo dir:
python test/distributed/pipelining/test_schedule_multiproc.py ScheduleTest.test_grad_with_manual_interleaved_ScheduleClass0_use_new_runtime_True
Only a tiny fraction of the elements (0.0002) are mismatched. I noticed that when I decrease the batch size (256 -> 64) I don't hit this error anymore, so I opted for that change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i stamped already, but i want to understand what happened here. I'm glad you can just use the strict test. But did you figure out what was wrong, or is this going to be flaky now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand it better now. Since the loss_fn
is MSELoss(reduction="sum")
then increasing the samples in the batch will lead to larger loss values
The losses were quite large, for example in test_grad_with_manual
and 2 microbatches I am seeing:
losses = [tensor(65603.1562, device='cuda:1', grad_fn=<MseLossBackward0>), tensor(65587.8594, device='cuda:1', grad_fn=<MseLossBackward0>)]
ref_loss = tensor(131191., device='cuda:1', grad_fn=<MseLossBackward0>)
Already see we precision loss here because sum(losses) = 131191.0156
and ref_loss is just 131191
. On top of this the floating point arithmetic isn't associative so (grad_from_loss1) + (grad_from_loss2)
≠ grad_from_(loss1 + loss2)
This explains why reducing the batch size 256 -> 64 helps. I also tried MSELoss(reduction="mean")
and it passes as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you split into 2 PRs (1) the refactor part, which i'll stamp (2) the change in behavior of the tests, which i'm wondering about
Updated the PR to only use the strict grad check so no need to split it anymore |
This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408). cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]
This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408). cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]
This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408). cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for cleaning this up!
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes #154408).
cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta