-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Performance issue with ARM64 windows Python release binaries #134524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Should start by comparing results with @mdboom to validate the difference. Also please test some recent versions - you're a couple of years out of date, and the compilers in particular have seen a lot of improvement in that time. I suggest looking up past discussions about compiler choice, it's been more than adequately covered in the past (short story - use clang-cl if you want, and be prepared to recompile every extension module you use, because we can't guarantee they'll be compatible). Most of the performance improvements over the last few releases have been driven by GCC on Linux, and so the results don't always show up on Windows. If you want to dig into some of the patterns and find ways to also make them fast on Windows, I'm sure the perf-focused people will be interested to hear about them. |
The latest ARM64 Python releases (3.13.0 and later) have even worse PyBench results than 3.12.10 and earlier releases. That's why I moved back to the older version. I can share the data. I would love to work on performance improvements for ARM64 releases for windows. |
@diegorusso might be interested too. |
Right, but I called out Mike because I know he has data, because we knew that performance on ARM64 was challenging and had already started looking into it. So let's continue from where that got to rather than starting fresh. I'll also mention that the MSVC compiler version is every bit as important as the Python version, so whoever is testing stuff, make sure you grab that number as well (it's in |
Thanks @zooba. For all the way, I am using the following MSVC versions: MSVC version: VS 2022, 19.43.34810 for ARM64 Windows on ARM is becoming a reality, and Microsoft is also serious about this. If we are able to achieve good performance scores on WOA devices, it will be a pretty big achievement. |
Just to check, can anyone confirm that Profile Guided Optimization (PGO) is working with MSVC ARM64 to compile Python (3.13.0) for ARM64? I tried using Visual Studio 2022 Professional, but it gave the following error: LINK : warning LNK4256: Profile Guided Optimization is not available in this edition of the product Does anyone have any idea how I can use PGO to compile Python for ARM64? By the way, I am using the following command: Build.bat --pgo -p ARM64 |
Here's the relevant lines from the 3.13.0 build log, it sure looks like it was using PGO (though still using the old arguments, which we should probably update to the new ones... not sure how big a difference that'll make, but it's worth testing): First line (which is very long) taken from this part of the build. The rest is from this part of the build.
|
BTW, if I use the VS development command prompt for ARM64 build (Native ARM64 compilation), I get an error mentioned in the comment. However, if I use PowerShell or the normal command prompt, it works fine. I noticed that for the recent 3.14.0b2 beta release, we are using the MSVC 14.43.34808 compiler. I tried the same repo with the latest 14.44.35207 compiler and observed better results in pyperformance and pybench with ARM64 binaries. Is there any chance we can move to the latest MSVC compiler for upcoming releases? Below are the results of pyperformance with the release versus locally compiled with MSVC 14.44.35207 with PGO. Benchmarks with tag 'apps':+----------------+---------------+------------------------+ Benchmarks with tag 'math':+----------------+---------------+-----------------------+ Benchmark hidden because not significant (1): pidigits Benchmarks with tag 'regex':+----------------+---------------+-----------------------+ Benchmark hidden because not significant (1): regex_v8 Benchmarks with tag 'serialize':+----------------------+---------------+------------------------+ Benchmark hidden because not significant (3): json_loads, pickle, pickle_list Benchmarks with tag 'startup':+------------------------+---------------+-----------------------+ Benchmarks with tag 'template':+----------------+---------------+-----------------------+ All benchmarks:+--------------------------+---------------+------------------------+ |
We build on Azure Pipelines infrastructure, which typically matches GitHub Action's images (with a couple days of eventual consistency as they deploy updates). We don't really want to maintain custom infrastructure for builds, though it does mean that updates arrive a bit slower. Though having said that, back when we did maintain custom infrastructure, the updates arrived even slower. So we've improved by moving to shared images. As a general rule, our binary distribution is useful and functional, but if you have special needs then you're more than welcome to build from source. You can get even more performance benefits for your scenario if you're willing to invest a little, so it really comes down to how much of that effort you want to try and persuade volunteers to do for you instead of doing it yourself. |
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
Hi Team,
I have an X-Elite laptop with an ARM64-based SoC, and I’ve been running Python workloads on it. However, I’ve noticed that Python seems to perform slower on Windows for ARM devices. To investigate, I used pybench, which provides a solid set of test cases for performance benchmarking. i have also taken intel x64 Lunarlake 258V device which has similar geekbench performance like X-Elite to see the performance delta.
Pybench : https://share.gtd-gmbh.de/d/7e9368c6350a4894bf8f/files/?p=%2FWorklets%2FPyBench%2Fpybench-for-3.10.tar.gz&dl=1
<style> </style>I collected the following results:
To further analyze, I tested multiple Python versions and observed that earlier ARM64 Windows releases performed better than the latest one:
<style> </style>Python Version Comparison
It’s clear that x64 performance has improved with each release, while ARM64 performance has been inconsistent, with a noticeable regression in the latest version.
<style> </style>I also cloned the Python 3.12.10 source and compiled it on the ARM64 Windows device using different compilers. I found that using clang-cl (19.1.2) with computed gotos enabled yielded significantly better performance than the official release:
Compiled vs. Released (Python 3.12.10)
Here i have question:
Can anybody please share the compilation steps (which compiler and flags) used to compile release ARM64 Windows binaries? If it is MSVC, is there any specific reason for not using clang-cl? Based on my experiment with pybench, I am seeing good results with clang-cl. Are there any other test cases we are running with release binaries where clang is not performing better?
Analysis:
I have tried to collect ETL logs in Windows and profile the test with Profile Explorer. The actual bottleneck I am seeing is in the compiler interpreter. The function python312.dll!_PyEval_EvalFrameDefault is the bottleneck.
CPython versions tested on:
CPython main branch, 3.12
Operating systems tested on:
Windows
The text was updated successfully, but these errors were encountered: