-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Stop parsing command line arguments every time common_utils is imported. #156703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156703
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit 0663f08 with merge base ecea811 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k --> This is now ready for review. It just moves the argparse command line parsing out of the root code of Unfortunately a lot of tests (Mostly the distributed ones) relied on side effects from that code like for example setting the seed in worker processes to the default seed (Note: the seed set on the command line was never passed to the children processes, so it always used the default value). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, sanity checked that the number of tests seem to stay the same, not really related but I wonder how necessary it is for some of the globals to actually be globals
Yes, I totally agree: some of the tests (Especially the distributed ones) rely on global states being implicitly set in all workers but I didn't have access to machines to run those tests on GPUs so I decided to play it safe and not refactor them too much 😅 |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "breaking tests internally with Breaking lots of tests (included distributed module) internally, with the same issue: "assert common_utils.SEED is not None".
probably |
@pytorchbot successfully started a revert job. Check the current status here. |
…s imported. (#156703)" This reverts commit 310f901. Reverted #156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](#156703 (comment)))
@AnthonyBarbier your PR has been successfully reverted. |
@izaitsevfb What do you mean "internally" ? What am I supposed to do about it? |
I mean that the issue was spotted when importing the change into Meta's codebase. However, distributed tests are open source, to name a few:
I'm not sure why this issue wasn't surfaced in OSS (do we run distributed tests?). The fix might be as simple as adding
I'd suggest to try repro the failure and ask @clee2000 or myself after the fix to re-import this PR again internally to double check that there are no more regressions. |
@izaitsevfb @clee2000 Could you please provide more details about the command lines used to run those tests and share the logs? (I don't have access to an environment with multiple GPUs to reproduce the issue and I haven't managed to hit by trying to hack the tests to run on CPU)
I disagree, the correct sequence should be:
The problem this PR is trying to fix is the fact test suites can't be imported into larger test suites because simply importing their module will trigger parsing command line arguments which might be completely irrelevant / different in that new context. As part of moving the command line parsing to the Now the distributed tests are a different level of dodgy 😅 : because they'll parse the command line in the parent process, then create a pool of processes which will not have access to the original command line arguments and therefore all the children's global states will implicitly initialise all globals the default values on
Now, this stacktrace really confuses me because it suggests the process pool is either created in the main process before the command line arguments are parsed or the test runs in a subprocess which itself tries to create a new process pool. AFAICT this test comes from this file which does have a |
The internal test runner is pretty different, I don't think it gives command line arguments at all or uses the normal main block, so you might have to provide a default value for SEED and some other vars |
This assert does exactly what it's supposed to do: it catches the fact you're trying to use an uninitialised global variable (Which wouldn't happen if you were not bypassing the main!). If I remove this assert / set a global default value, we go back to the original issue that there is no way to catch if a test forgets to parse the command line arguments and therefore they'll just get ignored. At that point we might as well just remove the "--seed" argument to be honest, which I guess is the other option? If you want to bypass the main in your infrastructure then you need to get your test runner to either call |
Hm is it necessary to parse command line vars? I don't think you're supposed to run distributed tests using pytest directly usually, but it is something that works with most other tests, and pytest wouldn't run the main block either? I don't know a lot about the internal test runner, but I'm pretty sure it uses pytest, but I don't know the details about test running and discovery |
I don't think it's something to do with pytest, it's just running the test from the command line in general. Anyway, if we think it's a |
Adding ciflow/periodic label to trigger distributed signals. |
Last PR in the series to re-submit #134592 as smaller PRs:
#154612
#154628
#154715
#154716
#154725
#154728
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta