-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
FIX: Move all tests using subprocess to the same worker on windows #29981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Well it passes here 🎉 Comparing with latest main (that failed), the same number of tests were run, and the test time was about a minute slower here, which all checks out. |
Trying to power-cycle, but apparently azure pipeline results are cached and not rerun 😞. |
Ok second try (only reworded commit message to let azure believe this is new). Let's see if we can get this passing multiple times. |
Second run also successful. Yet another rewording. Three's the charm. |
4 errors (two look like timeouts, one looks like a failure to read back a png and the last is some random test later when the warning about unclosed files got raised during GC and pytest failed the test it was in). |
One of the failing tests was not marked with the xdist_group. -> fix and retry. |
This is a somewhat wild guess based on matplotlib#29797 (comment) > which makes me think we are somehow crossing state what launching the subprocesses. Using the `xdist_group` with `--dist=loadgroup` should put all tests of that group to the same worker according to https://pytest-xdist.readthedocs.io/en/stable/distribution.html. I've only added `--dist=loadgroup` to the windows pipelines, so tests on other systems are not affected at all. The first test is to see whether this works in the PR. - But to be sure, it think we would need to put this on master and monitor whether the timeouts disappear.
First test run after fix successful. Force-push to re-run a second time. |
The annulus test that is failing is a bit pathological and there are two tests that share the same baseline image, I'm going to push a commit merging them. |
There is a lock for re-used filenames, but it is only used if you re-use a filename in the same test, so it is best to merge these, even if it would just be two figures in one test (which I don't think you did anyway.) |
lib/matplotlib/tests/test_patches.py
Outdated
@@ -825,8 +825,9 @@ def test_annulus(): | |||
ax.set_aspect('equal') | |||
|
|||
|
|||
@pytest.mark.parametrize('mode', ('a', 'b')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a
is (setting by) axis
and b
is radius
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that is what it is doing. I went with the most obvious to me solution, making two figures with the same name is a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On third thought, I left it paramaterized, but gave better names (and re-ordered the code a bit to only have one if/elif).
It is a bit aggressive to backport this PR, if anyone has the slightest concern lets not. |
If we are correct that the problem is trying to launch simultaneous subprocesses from the pytest-xdist workers, do we have any theory as to why this is a problem? This seems like something that should be safe. My best guess (I did look at the source of |
I'm also not clear on the mechanics in the background, but the issues liked in #29797 (comment) indicate that there can be interplay between the subprocesses and timeout handling, and that could be specific to windows. It's a bit weird that it only(?) happens for Python 3.10/3.11. I'm inclined to limit the the |
The recent change showed the timeout again. Can somebody with more pipeline configuration skills please verify whether my attempt to apply |
I'm not sure that |
I still have an echo statement in azure-pipelines.yml to veryify the code works as intended. It does. I'm letting the CI complete to get another data point on "this does not time out". When that's complete, I'll remove the echo. |
Everything seems to work. Echo removed. |
Note: Some of them have been marked flaky - may be worth reconstructing whether that was due to timeouts.
Got a timeout, and found out there were more tests using subprocesses, which weren't yet in the xdist_group. Added them. |
Let's put this on hold in favor of #29992. |
This is a somewhat wild guess based on #29797 (comment)
Using the
xdist_group
with--dist=loadgroup
should put all tests of that group to the same worker according to https://pytest-xdist.readthedocs.io/en/stable/distribution.html. I've only added--dist=loadgroup
to the windows pipelines, so tests on other systems are not affected at all.The first step is to see whether this works in the PR. - But to be sure, it think we would need to put this on master and monitor whether the timeouts disappear.