Stop parsing command line arguments every time common_utils is imported. #156703

AnthonyBarbier · 2025-06-24T14:24:16Z

Last PR in the series to re-submit #134592 as smaller PRs:

#154612
#154628
#154715
#154716
#154725
#154728

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot · 2025-06-24T14:24:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156703

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 0663f08 with merge base ecea811 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/testing/_internal/common_distributed.py:
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 6, 10, linux.s390x) (gh)
test_proxy_tensor.py::TestSymbolicTracing::test_constant_specialization

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 6, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionCudaTest::test_equivalent_template_code

This comment was automatically generated by Dr. CI and updates every 15 minutes.

AnthonyBarbier · 2025-07-24T09:32:05Z

@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k --> This is now ready for review.

It just moves the argparse command line parsing out of the root code of common_utils into a function which needs to be called by the tests from their __main__.

Unfortunately a lot of tests (Mostly the distributed ones) relied on side effects from that code like for example setting the seed in worker processes to the default seed (Note: the seed set on the command line was never passed to the children processes, so it always used the default value).

clee2000

Looks fine, sanity checked that the number of tests seem to stay the same, not really related but I wonder how necessary it is for some of the globals to actually be globals

AnthonyBarbier · 2025-08-01T08:24:07Z

Looks fine, sanity checked that the number of tests seem to stay the same, not really related but I wonder how necessary it is for some of the globals to actually be globals

Yes, I totally agree: some of the tests (Especially the distributed ones) rely on global states being implicitly set in all workers but I didn't have access to machines to run those tests on GPUs so I decided to play it safe and not refactor them too much 😅

pytorchmergebot · 2025-08-01T11:02:38Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

AnthonyBarbier · 2025-08-02T16:31:25Z

@pytorchbot merge

pytorchmergebot · 2025-08-02T16:33:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

izaitsevfb · 2025-08-04T20:35:46Z

@pytorchbot revert -m "breaking tests internally with assert common_utils.SEED is not None" -c ghfirst

Breaking lots of tests (included distributed module) internally, with the same issue: "assert common_utils.SEED is not None".

test_create_sharded_tensor_with_ones (test_sharded_tensor.TestShardedTensorChunked)
Test sharded_tensor.ones(...) ... /data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_utils.py:2450: UserWarning: set_rng_seed() was called without providing a seed and the command line arguments haven't been parsed so the seed will be set to 1234. To remove this warning make sure your test is run via run_tests() or parse_cmd_line_args() is called before set_rng_seed() is called.
  warnings.warn(msg)
FAIL

======================================================================
FAIL: test_create_sharded_tensor_with_ones (test_sharded_tensor.TestShardedTensorChunked)
Test sharded_tensor.ones(...)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 71, in setUp
    self._spawn_processes()
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 749, in _spawn_processes
    self._start_processes(proc)
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 720, in _start_processes
    assert common_utils.SEED is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

probably parse_cmd_line_args() need to be added to the relevant test suits / frameworks.

pytorchmergebot · 2025-08-04T20:37:30Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…s imported. (#156703)" This reverts commit 310f901. Reverted #156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](#156703 (comment)))

pytorchmergebot · 2025-08-04T20:37:43Z

@AnthonyBarbier your PR has been successfully reverted.

AnthonyBarbier · 2025-08-04T20:53:29Z

@pytorchbot revert -m "breaking tests internally with assert common_utils.SEED is not None" -c ghfirst

Breaking lots of tests (included distributed module) internally, with the same issue: "assert common_utils.SEED is not None".

@izaitsevfb What do you mean "internally" ? What am I supposed to do about it?

izaitsevfb · 2025-08-04T21:12:23Z

@izaitsevfb What do you mean "internally" ? What am I supposed to do about it?

I mean that the issue was spotted when importing the change into Meta's codebase. However, distributed tests are open source, to name a few:

FAIL: test_unshard_async (fsdp.test_fully_shard_comm.TestFullyShardUnshardMultiProcess)
FAIL: test_train_mixed_requires_grad_per_group (fsdp.test_fully_shard_frozen.TestFullyShardFrozen)
FAIL: test_bf16_hook_has_wrapping_False_sharding_strategy2 (test_fsdp_comm_hooks.TestCommunicationHooks)

I'm not sure why this issue wasn't surfaced in OSS (do we run distributed tests?). The fix might be as simple as adding parse_cmd_line_args() here:

pytorch/torch/testing/_internal/common_distributed.py

Line 687 in 310f901

def setUp(self) -> None:

I'd suggest to try repro the failure and ask @clee2000 or myself after the fix to re-import this PR again internally to double check that there are no more regressions.

AnthonyBarbier · 2025-08-05T14:03:09Z

@izaitsevfb @clee2000 Could you please provide more details about the command lines used to run those tests and share the logs? (I don't have access to an environment with multiple GPUs to reproduce the issue and I haven't managed to hit by trying to hack the tests to run on CPU)

The fix might be as simple as adding parse_cmd_line_args() here:

I disagree, the correct sequence should be:

Declare test suites / test cases
Parse command line arguments / set globals
Instantiate tests
Run tests

The problem this PR is trying to fix is the fact test suites can't be imported into larger test suites because simply importing their module will trigger parsing command line arguments which might be completely irrelevant / different in that new context.

As part of moving the command line parsing to the run_tests() function, it became clear that a large number of tests actually rely on global settings to declare the tests (i.e all the modules which have some kind of x = torch.randn() at their top level will implicitly rely on the seed being set or they won't be reproducible) so as a workaround I allowed some test suites to call parse_cmd_line_args() before declaring their tests. It's dodgy but still a bit better.

Now the distributed tests are a different level of dodgy 😅 : because they'll parse the command line in the parent process, then create a pool of processes which will not have access to the original command line arguments and therefore all the children's global states will implicitly initialise all globals the default values on import common_utils, and a lot of tests rely on this pattern of a whole bunch of globals being implicitly initialised to the same value in all the workers.
Anyway, in this case I did try to fix things by passing the seed from the parent to the children so that the command line options actually now do something.

Traceback (most recent call last):
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/distributed/_shard/sharded_tensor/init.py", line 71, in setUp
self._spawn_processes()
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 749, in _spawn_processes
self._start_processes(proc)
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 720, in _start_processes
assert common_utils.SEED is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Now, this stacktrace really confuses me because it suggests the process pool is either created in the main process before the command line arguments are parsed or the test runs in a subprocess which itself tries to create a new process pool.

AFAICT this test comes from this file which does have a run_tests() at the bottom? So is there special in the way these tests are instantiated / run?

clee2000 · 2025-08-05T15:31:16Z

The internal test runner is pretty different, I don't think it gives command line arguments at all or uses the normal main block, so you might have to provide a default value for SEED and some other vars

AnthonyBarbier · 2025-08-05T16:36:20Z

The internal test runner is pretty different, I don't think it gives command line arguments at all or uses the normal main block, so you might have to provide a default value for SEED and some other vars

This assert does exactly what it's supposed to do: it catches the fact you're trying to use an uninitialised global variable (Which wouldn't happen if you were not bypassing the main!).

If I remove this assert / set a global default value, we go back to the original issue that there is no way to catch if a test forgets to parse the command line arguments and therefore they'll just get ignored.

At that point we might as well just remove the "--seed" argument to be honest, which I guess is the other option?

If you want to bypass the main in your infrastructure then you need to get your test runner to either call parse_cmd_line_args() once to initialise the globals before you start running things or to manually set the globals: it's probably mostly just SEED and GRAPH_EXECUTOR which are used by tests I think.

clee2000 · 2025-08-06T15:45:59Z

Hm is it necessary to parse command line vars? I don't think you're supposed to run distributed tests using pytest directly usually, but it is something that works with most other tests, and pytest wouldn't run the main block either? I don't know a lot about the internal test runner, but I'm pretty sure it uses pytest, but I don't know the details about test running and discovery

AnthonyBarbier · 2025-08-07T11:02:04Z

Hm is it necessary to parse command line vars? I don't think you're supposed to run distributed tests using pytest directly usually, but it is something that works with most other tests, and pytest wouldn't run the main block either? I don't know a lot about the internal test runner, but I'm pretty sure it uses pytest, but I don't know the details about test running and discovery

I don't think it's something to do with pytest, it's just running the test from the command line in general.

Anyway, if we think it's a torch.distributed thing, then I could hardcode a seed for distributed tests only and print a warning if the user tries to set a different one from the command line indicating it's not supported and will have no effect.

nWEIdia · 2025-08-08T18:47:15Z

Adding ciflow/periodic label to trigger distributed signals.

… to see if it's enough

Stop parsing command line arguments every time common_utils is imported.

e5b1254

AnthonyBarbier requested a review from a team as a code owner June 24, 2025 14:24

AnthonyBarbier added the topic: not user facing topic category label Jun 24, 2025

pytorchbot added the open source label Jun 24, 2025

AnthonyBarbier added 5 commits June 24, 2025 16:13

Use Optional instead of |

adc6560

Fix test_jit_legacy test

edf4187

Move assert to when the variable is actually read

2a3b177

Fix jit tests

36da9a9

Set seed for distributed_test.py

65210f6

pytorch-bot bot added ciflow/h100-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 26, 2025

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2025

AnthonyBarbier added 14 commits July 10, 2025 11:40

Merge remote-tracking branch 'upstream/main' into argparse

36cbf8f

Fix seed setting

a4c5ee3

Fixing jit and distributed tests

0ece614

Clean up

a9c8899

Fix test_jit_fuser

e6dc730

Relax checks in set_rng_seed()

a4b833a

Fix more tests

79faf6c

Merge remote-tracking branch 'upstream/main' into argparse

6275b8d

Fix distributed model initialisation

fc93a2a

Make seed an optional argument

b87a0cd

Pass the seed for subclasses too

786db21

Handle seed in NCCLTraceTestBase

6c9730c

Merge remote-tracking branch 'upstream/main' into argparse

28eba9a

Fix seed in FSDP tests

2412e13

clee2000 added the ci-no-td Do not run TD on this PR label Jul 25, 2025

clee2000 approved these changes Jul 31, 2025

View reviewed changes

pytorchmergebot removed the merging label Aug 1, 2025

AnthonyBarbier added 2 commits August 1, 2025 17:04

Fix test_nn.py

5fe1c4b

Merge remote-tracking branch 'upstream/main' into argparse

650a70f

pytorchmergebot added the merging label Aug 2, 2025

pytorchmergebot added the Merged label Aug 2, 2025

pytorchmergebot closed this in 310f901 Aug 2, 2025

pytorchmergebot removed the merging label Aug 2, 2025

pytorchmergebot added the Reverted label Aug 4, 2025

pytorchmergebot reopened this Aug 4, 2025

nWEIdia added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 8, 2025

AnthonyBarbier added 3 commits August 11, 2025 10:04

Merge remote-tracking branch 'upstream/main' into argparse

073ee74

Trying to hardcode the seed for distributed tests in _start_processes…

59bf860

… to see if it's enough

Use logger.warning

0663f08

Stop parsing command line arguments every time common_utils is imported. #156703

Are you sure you want to change the base?

Stop parsing command line arguments every time common_utils is imported. #156703

Uh oh!

Conversation

AnthonyBarbier commented Jun 24, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156703

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

AnthonyBarbier commented Jul 24, 2025

Uh oh!

clee2000 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AnthonyBarbier commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Merge failed

Uh oh!

AnthonyBarbier commented Aug 2, 2025

Uh oh!

pytorchmergebot commented Aug 2, 2025

Merge started

Uh oh!

izaitsevfb commented Aug 4, 2025

Uh oh!

pytorchmergebot commented Aug 4, 2025

Uh oh!

pytorchmergebot commented Aug 4, 2025

Uh oh!

AnthonyBarbier commented Aug 4, 2025

Uh oh!

izaitsevfb commented Aug 4, 2025

Uh oh!

AnthonyBarbier commented Aug 5, 2025

Uh oh!

clee2000 commented Aug 5, 2025

Uh oh!

AnthonyBarbier commented Aug 5, 2025

Uh oh!

clee2000 commented Aug 6, 2025

Uh oh!

AnthonyBarbier commented Aug 7, 2025

Uh oh!

nWEIdia commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AnthonyBarbier commented Jun 24, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 24, 2025 •

edited

Loading

clee2000 left a comment •

edited

Loading

nWEIdia commented Aug 8, 2025 •

edited

Loading