-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[CI][CUDA] Add periodic b200 distributed job #159323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159323
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 2 Cancelled JobsAs of commit d3af5a4 with merge base e11b1cd ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
linux.dgx.b200 naming for runner.
…3_and_ecr_read_only to test job on B200 runner
dbb83c8
to
37d08be
Compare
For example, 8GPU B200 runner would make the unit test run with world_size = 8. Tested with: TEMP_DIR=/tmp BACKEND=nccl WORLD_SIZE=4 pytest -v test/distributed/test_distributed_spawn.py -k test_new_subgroups_world_size_not_divisible_by_group_size Test with B200 CI whether it fixes the failure in https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831786
Link #158695 for addressing the following failure also seen on B200: https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831792
|
require world_size of 4 to pass.
python test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_3_level_hierarchical_model_averager failure in https://github.com/pytorch/pytorch/actions/runs/16764647752/job/47468897559 is because: The test originally expected len(period_group_size_dict) groups, but that's incorrect. The actual count depends on how many subgroups are created for each group size. For world_size=4: subgroup_size=2 creates 2 groups, subgroup_size=4 creates 0 new groups → total 2, total pg_group would be 3. For world_size=8: subgroup_size=2 creates 4 groups, subgroup_size=4 creates 2 groups → total 6, total pg_group would be 7 |
Looking at the number from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189, I think we need to consider running this even less frequent, maybe nightly. My logic is that we have 3 8xB200 runners here, so the total bank we have a day is 3 * 24 = 72 hours. Each distributed shards from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189 seem to take close to 4 hours (I'm not sure why they take that long to finish), there are 3 of them set to run every 4 hour, so 3 * 4 hours * 6 times will eat up all the capacity. I'm proposing that we do the same as H100 distributed job to start with some smoke tests first https://github.com/pytorch/pytorch/blob/main/.github/workflows/h100-distributed.yml#L46. If we take a rough approach of dividing 72 hours into 4 groups for PyTorch core, compiler, distributed, and vLLM. Distributed jobs shouldn't take more than 25% or 18 hours per day. |
Good observation! But benchmarks are using linux.dgx.b200, which job would be using linux.dgx.b200.8? |
It seems that there is a legit use case for 4xB200, I will take 1 8xB200 and turn it into 2 4xB200. We can be flexible here. |
Yes, I think using 4XB200 may make a whole bunch of failed unit test pass automatically. |
quite long to finish all the tests (easily 4hours+ for each of the 3 shards).
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ptrblck @eqy @tinglvv @huydhn @atalman @ZainRizvi @malfet