Skip to content

Stop parsing command line arguments every time common_utils is imported. #156703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

AnthonyBarbier
Copy link
Collaborator

@AnthonyBarbier AnthonyBarbier commented Jun 24, 2025

@AnthonyBarbier AnthonyBarbier requested a review from a team as a code owner June 24, 2025 14:24
Copy link

pytorch-bot bot commented Jun 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156703

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 0663f08 with merge base ecea811 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/h100-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 26, 2025
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2025
@AnthonyBarbier
Copy link
Collaborator Author

@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k --> This is now ready for review.

It just moves the argparse command line parsing out of the root code of common_utils into a function which needs to be called by the tests from their __main__.

Unfortunately a lot of tests (Mostly the distributed ones) relied on side effects from that code like for example setting the seed in worker processes to the default seed (Note: the seed set on the command line was never passed to the children processes, so it always used the default value).

@clee2000 clee2000 added the ci-no-td Do not run TD on this PR label Jul 25, 2025
Copy link
Contributor

@clee2000 clee2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, sanity checked that the number of tests seem to stay the same, not really related but I wonder how necessary it is for some of the globals to actually be globals

@AnthonyBarbier
Copy link
Collaborator Author

Looks fine, sanity checked that the number of tests seem to stay the same, not really related but I wonder how necessary it is for some of the globals to actually be globals

Yes, I totally agree: some of the tests (Especially the distributed ones) rely on global states being implicitly set in all workers but I didn't have access to machines to run those tests on GPUs so I decided to play it safe and not refactor them too much 😅

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@AnthonyBarbier
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@izaitsevfb
Copy link
Contributor

@pytorchbot revert -m "breaking tests internally with assert common_utils.SEED is not None" -c ghfirst

Breaking lots of tests (included distributed module) internally, with the same issue: "assert common_utils.SEED is not None".

test_create_sharded_tensor_with_ones (test_sharded_tensor.TestShardedTensorChunked)
Test sharded_tensor.ones(...) ... /data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_utils.py:2450: UserWarning: set_rng_seed() was called without providing a seed and the command line arguments haven't been parsed so the seed will be set to 1234. To remove this warning make sure your test is run via run_tests() or parse_cmd_line_args() is called before set_rng_seed() is called.
  warnings.warn(msg)
FAIL

======================================================================
FAIL: test_create_sharded_tensor_with_ones (test_sharded_tensor.TestShardedTensorChunked)
Test sharded_tensor.ones(...)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 71, in setUp
    self._spawn_processes()
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 749, in _spawn_processes
    self._start_processes(proc)
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/__sharded_tensor__/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 720, in _start_processes
    assert common_utils.SEED is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

probably parse_cmd_line_args() need to be added to the relevant test suits / frameworks.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Aug 4, 2025
…s imported. (#156703)"

This reverts commit 310f901.

Reverted #156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](#156703 (comment)))
@pytorchmergebot
Copy link
Collaborator

@AnthonyBarbier your PR has been successfully reverted.

@AnthonyBarbier
Copy link
Collaborator Author

@pytorchbot revert -m "breaking tests internally with assert common_utils.SEED is not None" -c ghfirst

Breaking lots of tests (included distributed module) internally, with the same issue: "assert common_utils.SEED is not None".

@izaitsevfb What do you mean "internally" ? What am I supposed to do about it?

@izaitsevfb
Copy link
Contributor

@izaitsevfb What do you mean "internally" ? What am I supposed to do about it?

I mean that the issue was spotted when importing the change into Meta's codebase. However, distributed tests are open source, to name a few:

  • FAIL: test_unshard_async (fsdp.test_fully_shard_comm.TestFullyShardUnshardMultiProcess)
  • FAIL: test_train_mixed_requires_grad_per_group (fsdp.test_fully_shard_frozen.TestFullyShardFrozen)
  • FAIL: test_bf16_hook_has_wrapping_False_sharding_strategy2 (test_fsdp_comm_hooks.TestCommunicationHooks)

I'm not sure why this issue wasn't surfaced in OSS (do we run distributed tests?). The fix might be as simple as adding parse_cmd_line_args() here:

def setUp(self) -> None:

I'd suggest to try repro the failure and ask @clee2000 or myself after the fix to re-import this PR again internally to double check that there are no more regressions.

@AnthonyBarbier
Copy link
Collaborator Author

@izaitsevfb @clee2000 Could you please provide more details about the command lines used to run those tests and share the logs? (I don't have access to an environment with multiple GPUs to reproduce the issue and I haven't managed to hit by trying to hack the tests to run on CPU)

The fix might be as simple as adding parse_cmd_line_args() here:

I disagree, the correct sequence should be:

  • Declare test suites / test cases
  • Parse command line arguments / set globals
  • Instantiate tests
  • Run tests

The problem this PR is trying to fix is the fact test suites can't be imported into larger test suites because simply importing their module will trigger parsing command line arguments which might be completely irrelevant / different in that new context.

As part of moving the command line parsing to the run_tests() function, it became clear that a large number of tests actually rely on global settings to declare the tests (i.e all the modules which have some kind of x = torch.randn() at their top level will implicitly rely on the seed being set or they won't be reproducible) so as a workaround I allowed some test suites to call parse_cmd_line_args() before declaring their tests. It's dodgy but still a bit better.

Now the distributed tests are a different level of dodgy 😅 : because they'll parse the command line in the parent process, then create a pool of processes which will not have access to the original command line arguments and therefore all the children's global states will implicitly initialise all globals the default values on import common_utils, and a lot of tests rely on this pattern of a whole bunch of globals being implicitly initialised to the same value in all the workers.
Anyway, in this case I did try to fix things by passing the seed from the parent to the children so that the command line options actually now do something.

Traceback (most recent call last):
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/distributed/_shard/sharded_tensor/init.py", line 71, in setUp
self._spawn_processes()
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 749, in _spawn_processes
self._start_processes(proc)
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/c0c096af71f133c3/caffe2/test/distributed/_shard/sharded_tensor/sharded_tensor/sharded_tensor#link-tree/torch/testing/_internal/common_distributed.py", line 720, in _start_processes
assert common_utils.SEED is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Now, this stacktrace really confuses me because it suggests the process pool is either created in the main process before the command line arguments are parsed or the test runs in a subprocess which itself tries to create a new process pool.

AFAICT this test comes from this file which does have a run_tests() at the bottom? So is there special in the way these tests are instantiated / run?

@clee2000
Copy link
Contributor

clee2000 commented Aug 5, 2025

The internal test runner is pretty different, I don't think it gives command line arguments at all or uses the normal main block, so you might have to provide a default value for SEED and some other vars

@AnthonyBarbier
Copy link
Collaborator Author

The internal test runner is pretty different, I don't think it gives command line arguments at all or uses the normal main block, so you might have to provide a default value for SEED and some other vars

This assert does exactly what it's supposed to do: it catches the fact you're trying to use an uninitialised global variable (Which wouldn't happen if you were not bypassing the main!).

If I remove this assert / set a global default value, we go back to the original issue that there is no way to catch if a test forgets to parse the command line arguments and therefore they'll just get ignored.

At that point we might as well just remove the "--seed" argument to be honest, which I guess is the other option?

If you want to bypass the main in your infrastructure then you need to get your test runner to either call parse_cmd_line_args() once to initialise the globals before you start running things or to manually set the globals: it's probably mostly just SEED and GRAPH_EXECUTOR which are used by tests I think.

@clee2000
Copy link
Contributor

clee2000 commented Aug 6, 2025

Hm is it necessary to parse command line vars? I don't think you're supposed to run distributed tests using pytest directly usually, but it is something that works with most other tests, and pytest wouldn't run the main block either? I don't know a lot about the internal test runner, but I'm pretty sure it uses pytest, but I don't know the details about test running and discovery

@AnthonyBarbier
Copy link
Collaborator Author

Hm is it necessary to parse command line vars? I don't think you're supposed to run distributed tests using pytest directly usually, but it is something that works with most other tests, and pytest wouldn't run the main block either? I don't know a lot about the internal test runner, but I'm pretty sure it uses pytest, but I don't know the details about test running and discovery

I don't think it's something to do with pytest, it's just running the test from the command line in general.

Anyway, if we think it's a torch.distributed thing, then I could hardcode a seed for distributed tests only and print a warning if the user tries to set a different one from the command line indicating it's not supported and will have no effect.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 8, 2025

Adding ciflow/periodic label to trigger distributed signals.

@nWEIdia nWEIdia added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/h100-distributed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants