Skip to content

[MNT]: Flaky Windows_py31x tests on Azure Pipelines #29797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jayaddison opened this issue Mar 24, 2025 · 12 comments
Open

[MNT]: Flaky Windows_py31x tests on Azure Pipelines #29797

jayaddison opened this issue Mar 24, 2025 · 12 comments

Comments

@jayaddison
Copy link
Contributor

Summary

The Windows_py310 Azure Pipelines test jobs run by GitHub Actions for this repo appear to be unreliable at the moment.

Test timeouts and failures have occurred on both pull requests and mainline branch checks, and the failures do not appear related to the commits-under-test:

It's possible that this is a temporary/ephemeral issue.

Proposed fix

N/A (yet) - I'm not sure what the cause of these failures is yet

@jayaddison
Copy link
Contributor Author

This does also occur, albeit less frequently, on Windows_py312; e.g. here: https://dev.azure.com/matplotlib/matplotlib/_build/results?buildId=43004&view=logs&jobId=f92c32cd-721a-536e-3a87-a0c010db3f7f&j=f92c32cd-721a-536e-3a87-a0c010db3f7f&t=d4eecd58-7131-52e4-f793-e9e3ca32dedb

A commonality between the failure cases seems to be that one or more subprocess-related tests fails, usually due to a timeout.

@jayaddison jayaddison changed the title [MNT]: Flaky Windows_py310 test on Azure Pipelines [MNT]: Flaky Windows_py31x tests on Azure Pipelines Mar 24, 2025
@timhoffm
Copy link
Member

Note: This flaky behavior exists for some time now. Hard to reconstruct, but likely at least since the beginning of the year. While the CI run details have been cleaned up, I suspect that e.g. the following PRs were affected #29494 (Jan 20, 2025), #29358 (Dec 30, 2024), #29262 (Dec 9, 2024).

@jayaddison
Copy link
Contributor Author

Question: do all of the test cases that exhibit this use subprocess.PIPE for output/stderr? Initially I think: yes; but I haven't confirmed that exhaustively.

Superficially the failures appear similar to this bugreport (note: the PR mentioned in the description is where the bug was encountered -- not a subsequent PR to address it): python/cpython#85481

@dstansby
Copy link
Member

Sionce these are all on Python 3.10 (I think?), and according to NEP29 we can drop support for Python 3.10 on 4th April (https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table), it's probably not worth spending much time looking into this?

@timhoffm
Copy link
Member

Not quite, it also happens on newer python versions, e.g. #29797 (comment). But apparently less frequently.

@jayaddison
Copy link
Contributor Author

I think it's unlikely to be due to a bug in matplotlib itself, so to that extent I agree it's not worth spending much time on - I reckon it's more likely to be an artifact of the subprocess calls from the test suite.

Even so: if it's a sample case for a problem that could be figured out and resolved elsewhere, then that could be helpful (and eventually provide reduced test flakiness, meaning fewer distractions/delays, here as a benefit).

@timhoffm
Copy link
Member

I have no statistics, but the issue still happens on 3.11, and I have the feeling that it's more common now on 3.11 since we have removed 3.10. 🤔

@tacaswell
Copy link
Member

2025-04-26T03:41:49.5018861Z ================================== FAILURES ===================================
2025-04-26T03:41:49.5019558Z _________________________ test_axeswidget_interactive _________________________
2025-04-26T03:41:49.5020124Z [gw0] win32 -- Python 3.11.9 C:\hostedtoolcache\windows\Python\3.11.9\x64\python.exe
2025-04-26T03:41:49.5020520Z 
2025-04-26T03:41:49.5021989Z     @pytest.mark.xfail(  # https://github.com/actions/setup-python/issues/649
2025-04-26T03:41:49.5023115Z             ('TF_BUILD' in os.environ or 'GITHUB_ACTION' in os.environ) and
2025-04-26T03:41:49.5023718Z             sys.platform == 'darwin' and sys.version_info[:2] < (3, 11),
2025-04-26T03:41:49.5024241Z             reason='Tk version mismatch on Azure macOS CI'
2025-04-26T03:41:49.5024702Z         )
2025-04-26T03:41:49.5025169Z     def test_axeswidget_interactive():
2025-04-26T03:41:49.5025642Z >       subprocess_run_helper(
2025-04-26T03:41:49.5026110Z             _test_axeswidget_interactive,
2025-04-26T03:41:49.5026598Z             timeout=120 if is_ci_environment() else 20,
2025-04-26T03:41:49.5027077Z             extra_env={'MPLBACKEND': 'tkagg'}
2025-04-26T03:41:49.5027573Z         )
2025-04-26T03:41:49.5027859Z 
2025-04-26T03:41:49.5028365Z lib\matplotlib\tests\test_pickle.py:335: 
2025-04-26T03:41:49.5028863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2025-04-26T03:41:49.5029386Z lib\matplotlib\testing\__init__.py:128: in subprocess_run_helper
2025-04-26T03:41:49.5029889Z     proc = subprocess_run_for_testing(
2025-04-26T03:41:49.5030397Z lib\matplotlib\testing\__init__.py:96: in subprocess_run_for_testing
2025-04-26T03:41:49.5030892Z     proc = subprocess.run(
2025-04-26T03:41:49.5031400Z C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\subprocess.py:550: in run
2025-04-26T03:41:49.5032004Z     stdout, stderr = process.communicate(input, timeout=timeout)
2025-04-26T03:41:49.5032544Z C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\subprocess.py:1209: in communicate
2025-04-26T03:41:49.5033075Z     stdout, stderr = self._communicate(input, endtime, timeout)
2025-04-26T03:41:49.5033570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2025-04-26T03:41:49.5033881Z 
2025-04-26T03:41:49.5034362Z self = <Popen: returncode: 1 args: ['C:\\hostedtoolcache\\windows\\Python\\3.11.9\\...>
2025-04-26T03:41:49.5034874Z input = None, endtime = 1749.421, orig_timeout = 20
2025-04-26T03:41:49.5035239Z 
2025-04-26T03:41:49.5035742Z     def _communicate(self, input, endtime, orig_timeout):
2025-04-26T03:41:49.5036241Z         # Start reader threads feeding into a list hanging off of this
2025-04-26T03:41:49.5036745Z         # object, unless they've already been started.
2025-04-26T03:41:49.5037240Z         if self.stdout and not hasattr(self, "_stdout_buff"):
2025-04-26T03:41:49.5037716Z             self._stdout_buff = []
2025-04-26T03:41:49.5038167Z             self.stdout_thread = \
2025-04-26T03:41:49.5038635Z                     threading.Thread(target=self._readerthread,
2025-04-26T03:41:49.5039115Z                                      args=(self.stdout, self._stdout_buff))
2025-04-26T03:41:49.5039591Z             self.stdout_thread.daemon = True
2025-04-26T03:41:49.5040056Z             self.stdout_thread.start()
2025-04-26T03:41:49.5040525Z         if self.stderr and not hasattr(self, "_stderr_buff"):
2025-04-26T03:41:49.5041747Z             self._stderr_buff = []
2025-04-26T03:41:49.5042620Z             self.stderr_thread = \
2025-04-26T03:41:49.5043190Z                     threading.Thread(target=self._readerthread,
2025-04-26T03:41:49.5043741Z                                      args=(self.stderr, self._stderr_buff))
2025-04-26T03:41:49.5044303Z             self.stderr_thread.daemon = True
2025-04-26T03:41:49.5044820Z             self.stderr_thread.start()
2025-04-26T03:41:49.5045419Z     
2025-04-26T03:41:49.5046478Z         if self.stdin:
2025-04-26T03:41:49.5050733Z             self._stdin_write(input)
2025-04-26T03:41:49.5053285Z     
2025-04-26T03:41:49.5053837Z         # Wait for the reader threads, or time out.  If we time out, the
2025-04-26T03:41:49.5054352Z         # threads remain reading and the fds left open in case the user
2025-04-26T03:41:49.5055064Z         # calls communicate again.
2025-04-26T03:41:49.5057020Z         if self.stdout is not None:
2025-04-26T03:41:49.5059582Z             self.stdout_thread.join(self._remaining_time(endtime))
2025-04-26T03:41:49.5060257Z             if self.stdout_thread.is_alive():
2025-04-26T03:41:49.5060743Z >               raise TimeoutExpired(self.args, orig_timeout)
2025-04-26T03:41:49.5066717Z E               subprocess.TimeoutExpired: Command '['C:\\hostedtoolcache\\windows\\Python\\3.11.9\\x64\\python.exe', '-c', "import importlib.util;_spec = importlib.util.spec_from_file_location('matplotlib.tests.test_pickle', 'D:\\\\a\\\\1\\\\s\\\\lib\\\\matplotlib\\\\tests\\\\test_pickle.py');_module = importlib.util.module_from_spec(_spec);_spec.loader.exec_module(_module);_module._test_axeswidget_interactive()"]' timed out after 20 seconds
2025-04-26T03:41:49.5073916Z 
2025-04-26T03:41:49.5074559Z C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\subprocess.py:1630: TimeoutExpired

https://dev.azure.com/matplotlib/matplotlib/_build/results?buildId=43612&view=logs&j=c0a65fc6-a4f7-5116-4cee-beb3104aaa7c&t=ccc2daa7-0687-5084-0ea1-0e98a3640679&l=196

The odd thing to me is that the test that fails is test_axeswidget_interactive, but the communication failure talks about

"import importlib.util;_spec = importlib.util.spec_from_file_location('matplotlib.tests.test_pickle', 'D:\\\\a\\\\1\\\\s\\\\lib\\\\matplotlib\\\\tests\\\\test_pickle.py');

which makes me think we are somehow crossing state what launching the subprocesses.

timhoffm added a commit to timhoffm/matplotlib that referenced this issue Apr 28, 2025
This is a somewhat wild guess based on https://github
.com/matplotlib/issues/29797#issuecomment-2833795708

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup`
should put all tests of that group to the same
worker according to https://pytest-xdist
.readthedocs
.io/en/stable/distribution.html. I've only added `--dist=loadgroup` to
the windows pipelines, so tests on other systems are not affected at all
.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue Apr 28, 2025
This is a somewhat wild guess based on matplotlib#29797 (comment)

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup` should put all tests of
that group to the same worker according to
https://pytest-xdist.readthedocs.io/en/stable/distribution.html.
I've only added `--dist=loadgroup` to the windows pipelines, so tests on
 other systems are not affected at all.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue Apr 28, 2025
This is a somewhat wild guess based on
matplotlib#29797 (comment)

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup` should put all tests of
that group to the same worker according to
https://pytest-xdist.readthedocs.io/en/stable/distribution.html.
I've only added `--dist=loadgroup` to the windows pipelines, so tests on
 other systems are not affected at all.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue Apr 29, 2025
This is a somewhat wild guess based on
matplotlib#29797 (comment)

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup` should put all tests of
that group to the same worker according to
https://pytest-xdist.readthedocs.io/en/stable/distribution.html.
I've only added `--dist=loadgroup` to the windows pipelines, so tests on
 other systems are not affected at all.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue Apr 29, 2025
This is a somewhat wild guess based on
matplotlib#29797 (comment)

> which makes me think we are somehow crossing state what launching the subprocesses.

Using the `xdist_group` with `--dist=loadgroup` should put all tests of
that group to the same worker according to
https://pytest-xdist.readthedocs.io/en/stable/distribution.html.
I've only added `--dist=loadgroup` to the windows pipelines, so tests on
other systems are not affected at all.

The first test is to see whether this works in the PR. - But to be sure,
it think we would need to put this on master and monitor whether the
timeouts disappear.
@timhoffm
Copy link
Member

Confirmation from #29986

No timeout will occour if all tests are run sequentially.

@timhoffm
Copy link
Member

This may be a bad interaction between pytest-xdist and pytest-timeout. Related topics (I have not yet dug into them in detail):

@rcomer
Copy link
Member

rcomer commented Apr 30, 2025

I have no statistics, but the issue still happens on 3.11, and I have the feeling that it's more common now on 3.11 since we have removed 3.10. 🤔

When we dropped py310 we switched py311 to use the windows-2019 image. Could the image be the problem? I went searching for info about it and (if I’m looking at the right thing?) it is soon to be deprecated
actions/runner-images#12045
Link found via the "included software" column in the table here: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml

@timhoffm
Copy link
Member

timhoffm commented May 1, 2025

Good observation. Certainly worth a try to see whether switching to windows-2022 or windows-latest fixes the issue.

timhoffm added a commit to timhoffm/matplotlib that referenced this issue May 1, 2025
The windows-2019 image will be deprecated on 1.6.2025.
Also, this is a test whether the timeouts matplotlib#29797 are related to the
windows-2019 image.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue May 1, 2025
The windows-2019 image will be deprecated on 1.6.2025.
Also, this is a test whether the timeouts matplotlib#29797 are related to the
windows-2019 image.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue May 1, 2025
The windows-2019 image will be deprecated on 1.6.2025.
Also, this is a test whether the timeouts matplotlib#29797 are related to the
windows-2019 image.
timhoffm added a commit to timhoffm/matplotlib that referenced this issue May 2, 2025
The windows-2019 image will be deprecated on 1.6.2025.
Also, this is a test whether the timeouts matplotlib#29797 are related to the
windows-2019 image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants