FIX: Move all tests using subprocess to the same worker on windows #29981

timhoffm · 2025-04-28T20:26:05Z

This is a somewhat wild guess based on #29797 (comment)

which makes me think we are somehow crossing state what launching the subprocesses.

Using the xdist_group with --dist=loadgroup should put all tests of that group to the same worker according to https://pytest-xdist.readthedocs.io/en/stable/distribution.html. I've only added --dist=loadgroup to the windows pipelines, so tests on other systems are not affected at all.

The first step is to see whether this works in the PR. - But to be sure, it think we would need to put this on master and monitor whether the timeouts disappear.

dstansby · 2025-04-28T21:13:37Z

Well it passes here 🎉 Comparing with latest main (that failed), the same number of tests were run, and the test time was about a minute slower here, which all checks out.

timhoffm · 2025-04-28T22:03:12Z

Trying to power-cycle, but apparently azure pipeline results are cached and not rerun 😞.

timhoffm · 2025-04-28T22:06:08Z

Ok second try (only reworded commit message to let azure believe this is new). Let's see if we can get this passing multiple times.

timhoffm · 2025-04-28T22:41:12Z

Second run also successful.

Yet another rewording. Three's the charm.

tacaswell · 2025-04-29T04:47:14Z

4 errors (two look like timeouts, one looks like a failure to read back a png and the last is some random test later when the warning about unclosed files got raised during GC and pytest failed the test it was in).

timhoffm · 2025-04-29T07:14:40Z

One of the failing tests was not marked with the xdist_group. -> fix and retry.

This is a somewhat wild guess based on matplotlib#29797 (comment) > which makes me think we are somehow crossing state what launching the subprocesses. Using the `xdist_group` with `--dist=loadgroup` should put all tests of that group to the same worker according to https://pytest-xdist.readthedocs.io/en/stable/distribution.html. I've only added `--dist=loadgroup` to the windows pipelines, so tests on other systems are not affected at all. The first test is to see whether this works in the PR. - But to be sure, it think we would need to put this on master and monitor whether the timeouts disappear.

timhoffm · 2025-04-29T08:31:01Z

First test run after fix successful. Force-push to re-run a second time.

tacaswell · 2025-04-29T13:19:41Z

The annulus test that is failing is a bit pathological and there are two tests that share the same baseline image, I'm going to push a commit merging them.

QuLogic · 2025-04-29T22:16:38Z

There is a lock for re-used filenames, but it is only used if you re-use a filename in the same test, so it is best to merge these, even if it would just be two figures in one test (which I don't think you did anyway.)

QuLogic · 2025-04-29T22:17:35Z

lib/matplotlib/tests/test_patches.py

@@ -825,8 +825,9 @@ def test_annulus():
    ax.set_aspect('equal')


+@pytest.mark.parametrize('mode', ('a', 'b'))


It seems like a is (setting by) axis and b is radius.

yeah, that is what it is doing. I went with the most obvious to me solution, making two figures with the same name is a better solution.

On third thought, I left it paramaterized, but gave better names (and re-ordered the code a bit to only have one if/elif).

tacaswell · 2025-04-30T01:56:46Z

It is a bit aggressive to backport this PR, if anyone has the slightest concern lets not.

tacaswell · 2025-04-30T02:25:15Z

If we are correct that the problem is trying to launch simultaneous subprocesses from the pytest-xdist workers, do we have any theory as to why this is a problem? This seems like something that should be safe.

My best guess (I did look at the source of subprocess.py to try and verify this guess and was clear to me that would take more time than I can spend right now) is that the way xdist creates the additional workers the filehandles that are generated for the pipes are shared and hence we get the cross talk between the workers.

timhoffm · 2025-04-30T05:33:08Z

I'm also not clear on the mechanics in the background, but the issues liked in #29797 (comment) indicate that there can be interplay between the subprocesses and timeout handling, and that could be specific to windows.

It's a bit weird that it only(?) happens for Python 3.10/3.11. I'm inclined to limit the the --distgroup parameter to only 3.11. First because we then still test a with unsynced subprocesses and if 3.12/3.13 exhibit the error we are more sure it's a discriminating factor. Second, if it does not happen in 3.12/3.13 anymore, we can phase the usage of distgroups out when dropping 3.11, and remove unnecessary overhead.

timhoffm · 2025-04-30T13:53:16Z

The recent change showed the timeout again. Can somebody with more pipeline configuration skills please verify whether my attempt to apply --dist=loadgroup only to the py3.11 pipeline is valid?

QuLogic · 2025-04-30T22:54:01Z

I'm not sure that PYTHON_VERSION is actually set anywhere; it appears to be a leftover from when we did Pre-release testing. I think you need to use $(python.version) to fetch the runtime variable defined by the matrix.

timhoffm · 2025-05-01T07:04:57Z

I still have an echo statement in azure-pipelines.yml to veryify the code works as intended. It does. I'm letting the CI complete to get another data point on "this does not time out". When that's complete, I'll remove the echo.

timhoffm · 2025-05-01T07:19:23Z

Everything seems to work. Echo removed.

Note: Some of them have been marked flaky - may be worth reconstructing whether that was due to timeouts.

timhoffm · 2025-05-01T08:04:49Z

Got a timeout, and found out there were more tests using subprocesses, which weren't yet in the xdist_group. Added them.

timhoffm · 2025-05-01T08:40:51Z

Let's put this on hold in favor of #29992.

timhoffm · 2025-05-05T10:03:20Z

Superseeded by #29992.

timhoffm closed this Apr 28, 2025

timhoffm reopened this Apr 28, 2025

timhoffm force-pushed the fix-timeout branch from 4f63b88 to 40f20c3 Compare April 28, 2025 22:05

timhoffm force-pushed the fix-timeout branch from 40f20c3 to b4da8a0 Compare April 28, 2025 22:41

timhoffm force-pushed the fix-timeout branch from b4da8a0 to fac51fa Compare April 29, 2025 07:14

timhoffm force-pushed the fix-timeout branch from fac51fa to d3d72fc Compare April 29, 2025 08:23

TST: merge tests that share the same baseline image

75ba0d5

QuLogic reviewed Apr 29, 2025

View reviewed changes

TST: re-factor a bit and use better parameter names

dc401b3

tacaswell added this to the v3.10.2 milestone Apr 30, 2025

tacaswell approved these changes Apr 30, 2025

View reviewed changes

timhoffm force-pushed the fix-timeout branch from 0acc308 to 30b38a7 Compare April 30, 2025 12:34

timhoffm force-pushed the fix-timeout branch from 30b38a7 to 1769c1f Compare May 1, 2025 06:42

Limit usage of xdist grouping to Python 3.11

13d1e51

timhoffm force-pushed the fix-timeout branch from 1769c1f to 13d1e51 Compare May 1, 2025 07:18

Add some more tests using subprocesses to the subprocess xdist_group

0590a05

Note: Some of them have been marked flaky - may be worth reconstructing whether that was due to timeouts.

timhoffm marked this pull request as draft May 1, 2025 08:41

ksunden modified the milestones: v3.10.2, v3.10.3 May 2, 2025

timhoffm closed this May 5, 2025

QuLogic added the status: superseded label May 6, 2025

QuLogic removed this from the v3.10.3 milestone May 6, 2025

timhoffm mentioned this pull request May 9, 2025

Update pinned oldest win image on azure #29992

Merged

		@@ -825,8 +825,9 @@ def test_annulus():
		ax.set_aspect('equal')


		@pytest.mark.parametrize('mode', ('a', 'b'))

Uh oh!

FIX: Move all tests using subprocess to the same worker on windows #29981

FIX: Move all tests using subprocess to the same worker on windows #29981

Uh oh!

Conversation

timhoffm commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstansby commented Apr 28, 2025

Uh oh!

timhoffm commented Apr 28, 2025

Uh oh!

timhoffm commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timhoffm commented Apr 28, 2025

Uh oh!

tacaswell commented Apr 29, 2025

Uh oh!

timhoffm commented Apr 29, 2025

Uh oh!

timhoffm commented Apr 29, 2025

Uh oh!

tacaswell commented Apr 29, 2025

Uh oh!

QuLogic commented Apr 29, 2025

Uh oh!

QuLogic Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

tacaswell Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

tacaswell Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

tacaswell commented Apr 30, 2025

Uh oh!

tacaswell commented Apr 30, 2025

Uh oh!

timhoffm commented Apr 30, 2025

Uh oh!

timhoffm commented Apr 30, 2025

Uh oh!

QuLogic commented Apr 30, 2025

Uh oh!

timhoffm commented May 1, 2025

Uh oh!

timhoffm commented May 1, 2025

Uh oh!

timhoffm commented May 1, 2025

Uh oh!

timhoffm commented May 1, 2025

Uh oh!

timhoffm commented May 5, 2025

Uh oh!

Uh oh!

timhoffm commented Apr 28, 2025 •

edited

Loading

timhoffm commented Apr 28, 2025 •

edited

Loading