[CI][CUDA] Add periodic b200 distributed job #159323

nWEIdia · 2025-07-29T01:06:27Z

Run distributed job with B200 runner, periodically.
discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ptrblck @eqy @tinglvv @huydhn @atalman @ZainRizvi @malfet

pytorch-bot · 2025-07-29T01:06:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159323

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs

As of commit d3af5a4 with merge base e11b1cd ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/testing/_internal/distributed/distributed_test.py:
Lint / quick-checks / linux-job (gh)
RuntimeError: Command docker exec -t eee989739381274d72ae50d2b4e518eaf6f8fda3770aba35ef2ecd9bd3d07f31 /exec failed with exit code 255
pull / linux-jammy-cuda12.8-py3.10-gcc11 / build (gh)
ninja: build stopped: subcommand failed

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux.dgx.b200 naming for runner.

…3_and_ecr_read_only to test job on B200 runner

.github/workflows/pull.yml

For example, 8GPU B200 runner would make the unit test run with world_size = 8. Tested with: TEMP_DIR=/tmp BACKEND=nccl WORLD_SIZE=4 pytest -v test/distributed/test_distributed_spawn.py -k test_new_subgroups_world_size_not_divisible_by_group_size Test with B200 CI whether it fixes the failure in https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831786

nWEIdia · 2025-08-06T01:09:15Z

Link #158695 for addressing the following failure also seen on B200: https://github.com/pytorch/pytorch/actions/runs/16757750565/job/47447831792

AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh' 2025-08-05T20:20:16.5442776Z 2025-08-05T20:20:16.5442911Z To execute this test, run the following from the base repo dir: 2025-08-05T20:20:16.5443085Z python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e 2025-08-05T20:20:16.5443088Z 2025-08-05T20:20:16.5443304Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'?

require world_size of 4 to pass.

nWEIdia · 2025-08-07T22:55:45Z

python test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_3_level_hierarchical_model_averager failure in https://github.com/pytorch/pytorch/actions/runs/16764647752/job/47468897559 is because:

The test originally expected len(period_group_size_dict) groups, but that's incorrect. The actual count depends on how many subgroups are created for each group size. For world_size=4: subgroup_size=2 creates 2 groups, subgroup_size=4 creates 0 new groups → total 2, total pg_group would be 3. For world_size=8: subgroup_size=2 creates 4 groups, subgroup_size=4 creates 2 groups → total 6, total pg_group would be 7

huydhn · 2025-08-08T01:12:55Z

Looking at the number from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189, I think we need to consider running this even less frequent, maybe nightly. My logic is that we have 3 8xB200 runners here, so the total bank we have a day is 3 * 24 = 72 hours. Each distributed shards from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189 seem to take close to 4 hours (I'm not sure why they take that long to finish), there are 3 of them set to run every 4 hour, so 3 * 4 hours * 6 times will eat up all the capacity. I'm proposing that we do the same as H100 distributed job to start with some smoke tests first https://github.com/pytorch/pytorch/blob/main/.github/workflows/h100-distributed.yml#L46.

If we take a rough approach of dividing 72 hours into 4 groups for PyTorch core, compiler, distributed, and vLLM. Distributed jobs shouldn't take more than 25% or 18 hours per day.

nWEIdia · 2025-08-08T01:17:42Z

Looking at the number from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189, I think we need to consider running this even less frequent, maybe nightly. My logic is that we have 3 8xB200 runners here, so the total bank we have a day is 3 * 24 = 72 hours. Each distributed shards from https://github.com/pytorch/pytorch/actions/runs/16814299990/job/47629596189 seem to take close to 4 hours (I'm not sure why they take that long to finish), there are 3 of them set to run every 4 hour, so 3 * 4 hours * 6 times will eat up all the capacity. I'm proposing that we do the same as H100 distributed job to run just some smoke tests https://github.com/pytorch/pytorch/blob/main/.github/workflows/h100-distributed.yml#L46.

If we take a rough approach of dividing 72 into 4 groups for PyTorch core, compiler, distributed, and vLLM. Distributed jobs shouldn't take more than 25% or 18 hours per day.

Good observation!
I feel bad spending time only to fix issues that could also be exposed when running with e.g. A100 GPU (or perhaps even CPU multinode jobs). A subset of the distributed unit tests would just pass with world_size of 4, and not with world_size of 8.
So perhaps I should withdraw this PR, and let B200 nodes really work on benchmark jobs.

But benchmarks are using linux.dgx.b200, which job would be using linux.dgx.b200.8?

huydhn · 2025-08-08T01:25:29Z

But benchmarks are using linux.dgx.b200, which job would be using linux.dgx.b200.8?

Running distributed jobs on linux.dgx.b200.8 is a legit use case, we can loop in the team to see which tests they want to cover there first instead of running everything (cc @kwen2501)
The other user of linux.dgx.b200.8 is vLLM benchmark where it needs all 8 GPUs to cover large models like Deepseek v3

It seems that there is a legit use case for 4xB200, I will take 1 8xB200 and turn it into 2 4xB200. We can be flexible here.

nWEIdia · 2025-08-08T01:29:52Z

Yes, I think using 4XB200 may make a whole bunch of failed unit test pass automatically.
Running nightly with just linux.dgx.b200.4 seems to be better than periodic with linux.dgx.b200.8.

quite long to finish all the tests (easily 4hours+ for each of the 3 shards).

nWEIdia requested a review from a team as a code owner July 29, 2025 01:06

pytorch-bot bot added the topic: not user facing topic category label Jul 29, 2025

pytorchbot added the open source label Jul 29, 2025

HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 29, 2025

nWEIdia added 6 commits August 2, 2025 08:52

Add b200 distributed job

22015f3

Change linux.dgx.b200.8 to B200 as _linux_test does not support

06b2f63

linux.dgx.b200 naming for runner.

Add aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s…

a0ae666

…3_and_ecr_read_only to test job on B200 runner

Test two different runners

26f5d23

Use B200 to pass setup Linux step

064eb9e

Rebase and use linux.dgx.b200 sliced GPU setup!

37d08be

nWEIdia force-pushed the main-enable-b200-distributed-tests branch from dbb83c8 to 37d08be Compare August 2, 2025 15:53

nWEIdia changed the title ~~[Draft][CI][CUDA] Add b200 distributed job~~ [CI][CUDA] Add b200 distributed job Aug 2, 2025

Undo submodule changes

55e5d2e

eqy approved these changes Aug 4, 2025

View reviewed changes

nWEIdia mentioned this pull request Aug 4, 2025

[CI] Ensure NcclRegistrationTest is running in CI #159535

Open

huydhn reviewed Aug 4, 2025

View reviewed changes

.github/workflows/pull.yml Outdated Show resolved Hide resolved

Fix runner usage - this distributed job should use linux.dgx.b200.8

3e849c0

nWEIdia mentioned this pull request Aug 4, 2025

Add B200 smoke test #159494

Open

Moving the B200 distributed job from pull.yml to periodic.yml

d21e6ff

nWEIdia added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 5, 2025

nWEIdia changed the title ~~[CI][CUDA] Add b200 distributed job~~ [CI][CUDA] Add periodic b200 distributed job Aug 5, 2025

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 6, 2025

nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label Aug 6, 2025

Add require_exact_world_size for distributed unit tests that implicitly

d431613

require world_size of 4 to pass.

Fix lint. Fix test_3_level_hierarchical_model_averager

8fa0175

nWEIdia added 2 commits August 7, 2025 18:36

Mimic H100 distributed, run distributed less often because it takes

bfdaf3d

quite long to finish all the tests (easily 4hours+ for each of the 3 shards).

Discard periodic changes

38bbfab

nWEIdia removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 8, 2025

nWEIdia added 2 commits August 8, 2025 11:15

Remove target-determination from dependency

80b3087

Make the job title right.

d3af5a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][CUDA] Add periodic b200 distributed job #159323

[CI][CUDA] Add periodic b200 distributed job #159323

nWEIdia commented Jul 29, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

nWEIdia commented Aug 6, 2025

Uh oh!

nWEIdia commented Aug 7, 2025

Uh oh!

huydhn commented Aug 8, 2025 •

edited

Loading

Uh oh!

nWEIdia commented Aug 8, 2025

Uh oh!

huydhn commented Aug 8, 2025 •

edited

Loading

Uh oh!

nWEIdia commented Aug 8, 2025

Uh oh!

Uh oh!

[CI][CUDA] Add periodic b200 distributed job #159323

Are you sure you want to change the base?

[CI][CUDA] Add periodic b200 distributed job #159323

Conversation

nWEIdia commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159323

❌ 3 New Failures, 2 Cancelled Jobs

Uh oh!

Uh oh!

nWEIdia commented Aug 6, 2025

Uh oh!

nWEIdia commented Aug 7, 2025

Uh oh!

huydhn commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Aug 8, 2025

Uh oh!

huydhn commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Aug 8, 2025

Uh oh!

Uh oh!

nWEIdia commented Jul 29, 2025 •

edited

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

huydhn commented Aug 8, 2025 •

edited

Loading

huydhn commented Aug 8, 2025 •

edited

Loading