Skip to content

FIX: Move all tests using subprocess to the same worker on windows #29981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

timhoffm
Copy link
Member

@timhoffm timhoffm commented Apr 28, 2025

This is a somewhat wild guess based on #29797 (comment)

which makes me think we are somehow crossing state what launching the subprocesses.

Using the xdist_group with --dist=loadgroup should put all tests of that group to the same worker according to https://pytest-xdist.readthedocs.io/en/stable/distribution.html. I've only added --dist=loadgroup to the windows pipelines, so tests on other systems are not affected at all.

The first step is to see whether this works in the PR. - But to be sure, it think we would need to put this on master and monitor whether the timeouts disappear.

@dstansby
Copy link
Member

Well it passes here 🎉 Comparing with latest main (that failed), the same number of tests were run, and the test time was about a minute slower here, which all checks out.

@timhoffm timhoffm closed this Apr 28, 2025
@timhoffm timhoffm reopened this Apr 28, 2025
@timhoffm
Copy link
Member Author

Trying to power-cycle, but apparently azure pipeline results are cached and not rerun 😞.

@timhoffm
Copy link
Member Author

timhoffm commented Apr 28, 2025

Ok second try (only reworded commit message to let azure believe this is new). Let's see if we can get this passing multiple times.

@timhoffm
Copy link
Member Author

Second run also successful.

Yet another rewording. Three's the charm.

@tacaswell
Copy link
Member

4 errors (two look like timeouts, one looks like a failure to read back a png and the last is some random test later when the warning about unclosed files got raised during GC and pytest failed the test it was in).

@timhoffm
Copy link
Member Author

One of the failing tests was not marked with the xdist_group. -> fix and retry.

This is a somewhat wild guess based on
matplotlib#29797 (comment)

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup` should put all tests of
that group to the same worker according to
https://pytest-xdist.readthedocs.io/en/stable/distribution.html.
I've only added `--dist=loadgroup` to the windows pipelines, so tests on
other systems are not affected at all.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
@timhoffm
Copy link
Member Author

First test run after fix successful. Force-push to re-run a second time.

@tacaswell
Copy link
Member

The annulus test that is failing is a bit pathological and there are two tests that share the same baseline image, I'm going to push a commit merging them.

@QuLogic
Copy link
Member

QuLogic commented Apr 29, 2025

There is a lock for re-used filenames, but it is only used if you re-use a filename in the same test, so it is best to merge these, even if it would just be two figures in one test (which I don't think you did anyway.)

@@ -825,8 +825,9 @@ def test_annulus():
ax.set_aspect('equal')


@pytest.mark.parametrize('mode', ('a', 'b'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a is (setting by) axis and b is radius.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is what it is doing. I went with the most obvious to me solution, making two figures with the same name is a better solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On third thought, I left it paramaterized, but gave better names (and re-ordered the code a bit to only have one if/elif).

@tacaswell tacaswell added this to the v3.10.2 milestone Apr 30, 2025
@tacaswell
Copy link
Member

It is a bit aggressive to backport this PR, if anyone has the slightest concern lets not.

@tacaswell
Copy link
Member

If we are correct that the problem is trying to launch simultaneous subprocesses from the pytest-xdist workers, do we have any theory as to why this is a problem? This seems like something that should be safe.

My best guess (I did look at the source of subprocess.py to try and verify this guess and was clear to me that would take more time than I can spend right now) is that the way xdist creates the additional workers the filehandles that are generated for the pipes are shared and hence we get the cross talk between the workers.

@timhoffm
Copy link
Member Author

I'm also not clear on the mechanics in the background, but the issues liked in #29797 (comment) indicate that there can be interplay between the subprocesses and timeout handling, and that could be specific to windows.

It's a bit weird that it only(?) happens for Python 3.10/3.11. I'm inclined to limit the the --distgroup parameter to only 3.11. First because we then still test a with unsynced subprocesses and if 3.12/3.13 exhibit the error we are more sure it's a discriminating factor. Second, if it does not happen in 3.12/3.13 anymore, we can phase the usage of distgroups out when dropping 3.11, and remove unnecessary overhead.

@timhoffm
Copy link
Member Author

The recent change showed the timeout again. Can somebody with more pipeline configuration skills please verify whether my attempt to apply --dist=loadgroup only to the py3.11 pipeline is valid?

@QuLogic
Copy link
Member

QuLogic commented Apr 30, 2025

I'm not sure that PYTHON_VERSION is actually set anywhere; it appears to be a leftover from when we did Pre-release testing. I think you need to use $(python.version) to fetch the runtime variable defined by the matrix.

@timhoffm
Copy link
Member Author

timhoffm commented May 1, 2025

I still have an echo statement in azure-pipelines.yml to veryify the code works as intended. It does. I'm letting the CI complete to get another data point on "this does not time out". When that's complete, I'll remove the echo.

@timhoffm
Copy link
Member Author

timhoffm commented May 1, 2025

Everything seems to work. Echo removed.

Note: Some of them have been marked flaky - may be worth reconstructing
whether that was due to timeouts.
@timhoffm
Copy link
Member Author

timhoffm commented May 1, 2025

Got a timeout, and found out there were more tests using subprocesses, which weren't yet in the xdist_group. Added them.

@timhoffm
Copy link
Member Author

timhoffm commented May 1, 2025

Let's put this on hold in favor of #29992.

@timhoffm timhoffm marked this pull request as draft May 1, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants