Skip to content

[CI][CUDA] Add periodic b200 distributed job #159323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Jul 29, 2025

  1. Run distributed job with B200 runner, periodically.
  2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ptrblck @eqy @tinglvv @huydhn @atalman @ZainRizvi @malfet

@nWEIdia nWEIdia requested a review from a team as a code owner July 29, 2025 01:06
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 29, 2025
Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159323

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs

As of commit d3af5a4 with merge base e11b1cd (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@HDCharles HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 29, 2025
@nWEIdia nWEIdia force-pushed the main-enable-b200-distributed-tests branch from dbb83c8 to 37d08be Compare August 2, 2025 15:53
@nWEIdia nWEIdia changed the title [Draft][CI][CUDA] Add b200 distributed job [CI][CUDA] Add b200 distributed job Aug 2, 2025
@nWEIdia nWEIdia mentioned this pull request Aug 4, 2025
@nWEIdia nWEIdia added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 5, 2025
@nWEIdia nWEIdia changed the title [CI][CUDA] Add b200 distributed job [CI][CUDA] Add periodic b200 distributed job Aug 5, 2025
For example, 8GPU B200 runner would make the unit test run with
world_size = 8.
Tested with:
TEMP_DIR=/tmp BACKEND=nccl WORLD_SIZE=4 pytest -v test/distributed/test_distributed_spawn.py -k test_new_subgroups_world_size_not_divisible_by_group_size

Test with B200 CI whether it fixes the failure in https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831786
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 6, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 6, 2025

Link #158695 for addressing the following failure also seen on B200: https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831792

AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh' 2025-08-05T20:20:16.5442776Z 2025-08-05T20:20:16.5442911Z To execute this test, run the following from the base repo dir: 2025-08-05T20:20:16.5443085Z python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e 2025-08-05T20:20:16.5443088Z 2025-08-05T20:20:16.5443304Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'?

@nWEIdia nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label Aug 6, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 7, 2025

python test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_3_level_hierarchical_model_averager failure in https://github.com/pytorch/pytorch/actions/runs/16764647752/job/47468897559 is because:

The test originally expected len(period_group_size_dict) groups, but that's incorrect. The actual count depends on how many subgroups are created for each group size. For world_size=4: subgroup_size=2 creates 2 groups, subgroup_size=4 creates 0 new groups → total 2, total pg_group would be 3. For world_size=8: subgroup_size=2 creates 4 groups, subgroup_size=4 creates 2 groups → total 6, total pg_group would be 7

@huydhn
Copy link
Contributor

huydhn commented Aug 8, 2025

Looking at the number from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189, I think we need to consider running this even less frequent, maybe nightly. My logic is that we have 3 8xB200 runners here, so the total bank we have a day is 3 * 24 = 72 hours. Each distributed shards from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189 seem to take close to 4 hours (I'm not sure why they take that long to finish), there are 3 of them set to run every 4 hour, so 3 * 4 hours * 6 times will eat up all the capacity. I'm proposing that we do the same as H100 distributed job to start with some smoke tests first https://github.com/pytorch/pytorch/blob/main/.github/workflows/h100-distributed.yml#L46.

If we take a rough approach of dividing 72 hours into 4 groups for PyTorch core, compiler, distributed, and vLLM. Distributed jobs shouldn't take more than 25% or 18 hours per day.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 8, 2025

Looking at the number from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189, I think we need to consider running this even less frequent, maybe nightly. My logic is that we have 3 8xB200 runners here, so the total bank we have a day is 3 * 24 = 72 hours. Each distributed shards from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189 seem to take close to 4 hours (I'm not sure why they take that long to finish), there are 3 of them set to run every 4 hour, so 3 * 4 hours * 6 times will eat up all the capacity. I'm proposing that we do the same as H100 distributed job to run just some smoke tests https://github.com/pytorch/pytorch/blob/main/.github/workflows/h100-distributed.yml#L46.

If we take a rough approach of dividing 72 into 4 groups for PyTorch core, compiler, distributed, and vLLM. Distributed jobs shouldn't take more than 25% or 18 hours per day.

Good observation!
I feel bad spending time only to fix issues that could also be exposed when running with e.g. A100 GPU (or perhaps even CPU multinode jobs). A subset of the distributed unit tests would just pass with world_size of 4, and not with world_size of 8.
So perhaps I should withdraw this PR, and let B200 nodes really work on benchmark jobs.

But benchmarks are using linux.dgx.b200, which job would be using linux.dgx.b200.8?

@huydhn
Copy link
Contributor

huydhn commented Aug 8, 2025

But benchmarks are using linux.dgx.b200, which job would be using linux.dgx.b200.8?

  • Running distributed jobs on linux.dgx.b200.8 is a legit use case, we can loop in the team to see which tests they want to cover there first instead of running everything (cc @kwen2501)
  • The other user of linux.dgx.b200.8 is vLLM benchmark where it needs all 8 GPUs to cover large models like Deepseek v3

It seems that there is a legit use case for 4xB200, I will take 1 8xB200 and turn it into 2 4xB200. We can be flexible here.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 8, 2025

Yes, I think using 4XB200 may make a whole bunch of failed unit test pass automatically.
Running nightly with just linux.dgx.b200.4 seems to be better than periodic with linux.dgx.b200.8.

nWEIdia added 2 commits August 7, 2025 18:36
quite long to finish all the tests (easily 4hours+ for each of the 3
shards).
@nWEIdia nWEIdia removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep-going Don't stop on first failure, keep running tests until the end oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants