Skip to content

Performance issue with ARM64 windows Python release binaries #134524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Akash9824 opened this issue May 22, 2025 · 9 comments
Open

Performance issue with ARM64 windows Python release binaries #134524

Akash9824 opened this issue May 22, 2025 · 9 comments
Labels
OS-windows pending The issue will be closed if no feedback is provided performance Performance or resource usage triaged The issue has been accepted as valid by a triager.

Comments

@Akash9824
Copy link

Akash9824 commented May 22, 2025

Bug report

Bug description:

Hi Team,

I have an X-Elite laptop with an ARM64-based SoC, and I’ve been running Python workloads on it. However, I’ve noticed that Python seems to perform slower on Windows for ARM devices. To investigate, I used pybench, which provides a solid set of test cases for performance benchmarking. i have also taken intel x64 Lunarlake 258V device which has similar geekbench performance like X-Elite to see the performance delta.

Pybench : https://share.gtd-gmbh.de/d/7e9368c6350a4894bf8f/files/?p=%2FWorklets%2FPyBench%2Fpybench-for-3.10.tar.gz&dl=1
I collected the following results:

<style> </style>

Environment Total Time (ms)
Windows on ARM64 802
Windows on x64 507
WSL2 (Linux on windows ARM64) 515

To further analyze, I tested multiple Python versions and observed that earlier ARM64 Windows releases performed better than the latest one:
Python Version Comparison

<style> </style>

Version Windows ARM64 (ms) Windows x64 (ms)
3.11.0 763 575
3.11.3 589 Not tested
3.11.6 590 Not tested
3.11.9 568 Not tested
3.12.0 666 545
3.12.5 688 Not tested
3.12.6 700 Not tested
3.12.7 802 Not tested
3.12.10 802 507

It’s clear that x64 performance has improved with each release, while ARM64 performance has been inconsistent, with a noticeable regression in the latest version.
I also cloned the Python 3.12.10 source and compiled it on the ARM64 Windows device using different compilers. I found that using clang-cl (19.1.2) with computed gotos enabled yielded significantly better performance than the official release:
Compiled vs. Released (Python 3.12.10)

<style> </style>

version ARM64 (Release)​ ARM64 (Compiled)​
Python v3.12.10 (ms)​ 802​ 628​

Here i have question:
Can anybody please share the compilation steps (which compiler and flags) used to compile release ARM64 Windows binaries? If it is MSVC, is there any specific reason for not using clang-cl? Based on my experiment with pybench, I am seeing good results with clang-cl. Are there any other test cases we are running with release binaries where clang is not performing better?

Analysis:
I have tried to collect ETL logs in Windows and profile the test with Profile Explorer. The actual bottleneck I am seeing is in the compiler interpreter. The function python312.dll!_PyEval_EvalFrameDefault is the bottleneck.

CPython versions tested on:

CPython main branch, 3.12

Operating systems tested on:

Windows

@Akash9824 Akash9824 added the type-bug An unexpected behavior, bug, or error label May 22, 2025
@Akash9824 Akash9824 changed the title Performance issue with ARM64 Python release binaries Performance issue with ARM64 windows Python release binaries May 22, 2025
@zooba zooba added performance Performance or resource usage OS-windows and removed type-bug An unexpected behavior, bug, or error labels May 22, 2025
@zooba
Copy link
Member

zooba commented May 22, 2025

Should start by comparing results with @mdboom to validate the difference. Also please test some recent versions - you're a couple of years out of date, and the compilers in particular have seen a lot of improvement in that time.

I suggest looking up past discussions about compiler choice, it's been more than adequately covered in the past (short story - use clang-cl if you want, and be prepared to recompile every extension module you use, because we can't guarantee they'll be compatible).

Most of the performance improvements over the last few releases have been driven by GCC on Linux, and so the results don't always show up on Windows. If you want to dig into some of the patterns and find ways to also make them fast on Windows, I'm sure the perf-focused people will be interested to hear about them.

@Akash9824
Copy link
Author

The latest ARM64 Python releases (3.13.0 and later) have even worse PyBench results than 3.12.10 and earlier releases. That's why I moved back to the older version. I can share the data. I would love to work on performance improvements for ARM64 releases for windows.

@brandtbucher
Copy link
Member

@diegorusso might be interested too.

@zooba
Copy link
Member

zooba commented May 22, 2025

Right, but I called out Mike because I know he has data, because we knew that performance on ARM64 was challenging and had already started looking into it. So let's continue from where that got to rather than starting fresh.

I'll also mention that the MSVC compiler version is every bit as important as the Python version, so whoever is testing stuff, make sure you grab that number as well (it's in sys.version). Because we build releases on public CI machines, it can vary over time, and we don't tightly control it. Though in general the behaviour of MSVC over time is more stable (with the exception of ARM64 right now, because the MSVC team is actively working on improving it in each update).

@Akash9824
Copy link
Author

Akash9824 commented May 23, 2025

Thanks @zooba.

For all the way, I am using the following MSVC versions:

MSVC version: VS 2022, 19.43.34810 for ARM64
Clang-CL version: 19.1.2

Windows on ARM is becoming a reality, and Microsoft is also serious about this. If we are able to achieve good performance scores on WOA devices, it will be a pretty big achievement.

@picnixz picnixz added type-bug An unexpected behavior, bug, or error triaged The issue has been accepted as valid by a triager. pending The issue will be closed if no feedback is provided and removed type-bug An unexpected behavior, bug, or error labels May 23, 2025
@Akash9824
Copy link
Author

Just to check, can anyone confirm that Profile Guided Optimization (PGO) is working with MSVC ARM64 to compile Python (3.13.0) for ARM64? I tried using Visual Studio 2022 Professional, but it gave the following error:

LINK : warning LNK4256: Profile Guided Optimization is not available in this edition of the product

Does anyone have any idea how I can use PGO to compile Python for ARM64?

By the way, I am using the following command: Build.bat --pgo -p ARM64
Compiler version: Compiler Version 19.44.35207.1 for ARM64
MSBuild version: MSBuild version 17.14.8+a7a4d5af0

@zooba
Copy link
Member

zooba commented May 30, 2025

Here's the relevant lines from the 3.13.0 build log, it sure looks like it was using PGO (though still using the old arguments, which we should probably update to the new ones... not sure how big a difference that'll make, but it's worth testing):

First line (which is very long) taken from this part of the build. The rest is from this part of the build.

   C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.41.34120\bin\HostX86\arm64\link.exe /ERRORREPORT:QUEUE /OUT:"D:\a\1\b\bin\arm64\instrumented\python313.dll" /INCREMENTAL:NO /NOLOGO /LIBPATH:D:\a\1\b\bin\arm64\instrumented\ version.lib ws2_32.lib pathcch.lib bcrypt.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /NODEFAULTLIB:LIBC /MANIFEST:NO /DEBUG /PDB:"D:\a\1\b\bin\arm64\instrumented\python313.pdb" /SUBSYSTEM:WINDOWS /PGD:"D:\a\1\b\bin\arm64\instrumented\python313.pgd" /LTCG:PGInstrument /LTCGOUT:"D:\a\1\s\PCbuild\obj\313arm64_PGInstrument\pythoncore\python313.iobj" /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:\a\1\b\bin\arm64\instrumented\python313.lib" /MACHINE:ARM64 /OPT:REF,NOICF /DLL D:\a\1\s\PCbuild\obj\313arm64_PGInstrument\pythoncore\python_nt.res
  

         Merging D:\a\1\b\bin\arm64\python313!1.pgc
         D:\a\1\b\bin\arm64\python313!1.pgc: Used 26.6% (16469240 / 61988864) of total space reserved.  0.0% of the counts were dropped due to overflow.
           Reading PGD file 1: D:\a\1\b\bin\arm64\python313.pgd
            Creating library D:\a\1\b\bin\arm64\python313.lib and object D:\a\1\b\bin\arm64\python313.exp
         Generating code
         
         0 of 0 ( 0.0%) original invalid call sites were matched.
         0 new call sites were added.
         278 of 15678 (  1.77%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
         31823 of 80106 inline instances were from dead/cold paths 
         15674 of 15678 functions (100.0%) were optimized using profile data, and the rest of the functions were optimized without using profile data
         229589446045 of 229589446045 instructions (100.0%) were optimized using profile data, and the rest of the instructions were optimized without using profile data
         Finished generating code
         pythoncore.vcxproj -> D:\a\1\b\bin\arm64\python313.dll

@Akash9824
Copy link
Author

Akash9824 commented Jun 6, 2025

BTW, if I use the VS development command prompt for ARM64 build (Native ARM64 compilation), I get an error mentioned in the comment. However, if I use PowerShell or the normal command prompt, it works fine.

I noticed that for the recent 3.14.0b2 beta release, we are using the MSVC 14.43.34808 compiler. I tried the same repo with the latest 14.44.35207 compiler and observed better results in pyperformance and pybench with ARM64 binaries. Is there any chance we can move to the latest MSVC compiler for upcoming releases? Below are the results of pyperformance with the release versus locally compiled with MSVC 14.44.35207 with PGO.

Benchmarks with tag 'apps':

+----------------+---------------+------------------------+
| Benchmark | py314b2_local | py314b2_release |
+================+===============+========================+
| 2to3 | 209 ms | 228 ms: 1.09x slower |
+----------------+---------------+------------------------+
| docutils | 1.51 sec | 1.60 sec: 1.06x slower |
+----------------+---------------+------------------------+
| html5lib | 33.5 ms | 38.3 ms: 1.14x slower |
+----------------+---------------+------------------------+
| Geometric mean | (ref) | 1.10x slower |
+----------------+---------------+------------------------+

Benchmarks with tag 'math':

+----------------+---------------+-----------------------+
| Benchmark | py314b2_local | py314b2_release |
+================+===============+=======================+
| float | 57.6 ms | 66.8 ms: 1.16x slower |
+----------------+---------------+-----------------------+
| nbody | 82.0 ms | 96.6 ms: 1.18x slower |
+----------------+---------------+-----------------------+
| Geometric mean | (ref) | 1.11x slower |
+----------------+---------------+-----------------------+

Benchmark hidden because not significant (1): pidigits

Benchmarks with tag 'regex':

+----------------+---------------+-----------------------+
| Benchmark | py314b2_local | py314b2_release |
+================+===============+=======================+
| regex_compile | 78.9 ms | 89.6 ms: 1.14x slower |
+----------------+---------------+-----------------------+
| regex_dna | 131 ms | 122 ms: 1.07x faster |
+----------------+---------------+-----------------------+
| regex_effbot | 2.26 ms | 2.19 ms: 1.03x faster |
+----------------+---------------+-----------------------+
| Geometric mean | (ref) | 1.01x slower |
+----------------+---------------+-----------------------+

Benchmark hidden because not significant (1): regex_v8

Benchmarks with tag 'serialize':

+----------------------+---------------+------------------------+
| Benchmark | py314b2_local | py314b2_release |
+======================+===============+========================+
| json_dumps | 6.16 ms | 6.52 ms: 1.06x slower |
+----------------------+---------------+------------------------+
| pickle_dict | 21.4 us | 21.0 us: 1.02x faster |
+----------------------+---------------+------------------------+
| pickle_pure_python | 221 us | 250 us: 1.13x slower |
+----------------------+---------------+------------------------+
| tomli_loads | 1.65 sec | 1.74 sec: 1.05x slower |
+----------------------+---------------+------------------------+
| unpickle | 8.47 us | 8.83 us: 1.04x slower |
+----------------------+---------------+------------------------+
| unpickle_list | 3.34 us | 3.28 us: 1.02x faster |
+----------------------+---------------+------------------------+
| unpickle_pure_python | 157 us | 192 us: 1.22x slower |
+----------------------+---------------+------------------------+
| xml_etree_parse | 95.2 ms | 102 ms: 1.07x slower |
+----------------------+---------------+------------------------+
| xml_etree_iterparse | 66.1 ms | 71.5 ms: 1.08x slower |
+----------------------+---------------+------------------------+
| xml_etree_generate | 57.0 ms | 63.5 ms: 1.11x slower |
+----------------------+---------------+------------------------+
| xml_etree_process | 40.8 ms | 46.6 ms: 1.14x slower |
+----------------------+---------------+------------------------+
| Geometric mean | (ref) | 1.06x slower |
+----------------------+---------------+------------------------+

Benchmark hidden because not significant (3): json_loads, pickle, pickle_list

Benchmarks with tag 'startup':

+------------------------+---------------+-----------------------+
| Benchmark | py314b2_local | py314b2_release |
+========================+===============+=======================+
| python_startup | 22.7 ms | 24.2 ms: 1.07x slower |
+------------------------+---------------+-----------------------+
| python_startup_no_site | 19.1 ms | 20.2 ms: 1.06x slower |
+------------------------+---------------+-----------------------+
| Geometric mean | (ref) | 1.06x slower |
+------------------------+---------------+-----------------------+

Benchmarks with tag 'template':

+----------------+---------------+-----------------------+
| Benchmark | py314b2_local | py314b2_release |
+================+===============+=======================+
| genshi_text | 15.7 ms | 18.2 ms: 1.16x slower |
+----------------+---------------+-----------------------+
| genshi_xml | 34.4 ms | 37.9 ms: 1.10x slower |
+----------------+---------------+-----------------------+
| mako | 8.23 ms | 9.23 ms: 1.12x slower |
+----------------+---------------+-----------------------+
| Geometric mean | (ref) | 1.13x slower |
+----------------+---------------+-----------------------+

All benchmarks:

+--------------------------+---------------+------------------------+
| Benchmark | py314b2_local | py314b2_release |
+==========================+===============+========================+
| 2to3 | 209 ms | 228 ms: 1.09x slower |
+--------------------------+---------------+------------------------+
| async_generators | 252 ms | 275 ms: 1.09x slower |
+--------------------------+---------------+------------------------+
| asyncio_tcp | 359 ms | 398 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| chaos | 42.6 ms | 47.0 ms: 1.10x slower |
+--------------------------+---------------+------------------------+
| comprehensions | 12.3 us | 14.0 us: 1.14x slower |
+--------------------------+---------------+------------------------+
| bench_mp_pool | 69.6 ms | 81.8 ms: 1.17x slower |
+--------------------------+---------------+------------------------+
| bench_thread_pool | 1.03 ms | 1.09 ms: 1.06x slower |
+--------------------------+---------------+------------------------+
| coroutines | 18.2 ms | 22.3 ms: 1.23x slower |
+--------------------------+---------------+------------------------+
| coverage | 145 ms | 208 ms: 1.43x slower |
+--------------------------+---------------+------------------------+
| crypto_pyaes | 51.6 ms | 58.0 ms: 1.13x slower |
+--------------------------+---------------+------------------------+
| deepcopy | 170 us | 185 us: 1.09x slower |
+--------------------------+---------------+------------------------+
| deepcopy_reduce | 1.71 us | 1.91 us: 1.12x slower |
+--------------------------+---------------+------------------------+
| deepcopy_memo | 22.4 us | 25.2 us: 1.13x slower |
+--------------------------+---------------+------------------------+
| deltablue | 2.66 ms | 3.21 ms: 1.21x slower |
+--------------------------+---------------+------------------------+
| docutils | 1.51 sec | 1.60 sec: 1.06x slower |
+--------------------------+---------------+------------------------+
| fannkuch | 286 ms | 321 ms: 1.12x slower |
+--------------------------+---------------+------------------------+
| float | 57.6 ms | 66.8 ms: 1.16x slower |
+--------------------------+---------------+------------------------+
| create_gc_cycles | 888 us | 909 us: 1.02x slower |
+--------------------------+---------------+------------------------+
| generators | 26.4 ms | 28.7 ms: 1.09x slower |
+--------------------------+---------------+------------------------+
| genshi_text | 15.7 ms | 18.2 ms: 1.16x slower |
+--------------------------+---------------+------------------------+
| genshi_xml | 34.4 ms | 37.9 ms: 1.10x slower |
+--------------------------+---------------+------------------------+
| go | 91.5 ms | 110 ms: 1.20x slower |
+--------------------------+---------------+------------------------+
| hexiom | 4.83 ms | 5.69 ms: 1.18x slower |
+--------------------------+---------------+------------------------+
| html5lib | 33.5 ms | 38.3 ms: 1.14x slower |
+--------------------------+---------------+------------------------+
| json_dumps | 6.16 ms | 6.52 ms: 1.06x slower |
+--------------------------+---------------+------------------------+
| logging_format | 7.17 us | 8.56 us: 1.19x slower |
+--------------------------+---------------+------------------------+
| logging_silent | 285 ns | 341 ns: 1.20x slower |
+--------------------------+---------------+------------------------+
| logging_simple | 6.57 us | 7.73 us: 1.18x slower |
+--------------------------+---------------+------------------------+
| mako | 8.23 ms | 9.23 ms: 1.12x slower |
+--------------------------+---------------+------------------------+
| mdp | 800 ms | 910 ms: 1.14x slower |
+--------------------------+---------------+------------------------+
| meteor_contest | 71.3 ms | 74.4 ms: 1.04x slower |
+--------------------------+---------------+------------------------+
| nbody | 82.0 ms | 96.6 ms: 1.18x slower |
+--------------------------+---------------+------------------------+
| nqueens | 62.3 ms | 67.5 ms: 1.08x slower |
+--------------------------+---------------+------------------------+
| pathlib | 22.4 ms | 23.1 ms: 1.03x slower |
+--------------------------+---------------+------------------------+
| pickle_dict | 21.4 us | 21.0 us: 1.02x faster |
+--------------------------+---------------+------------------------+
| pickle_pure_python | 221 us | 250 us: 1.13x slower |
+--------------------------+---------------+------------------------+
| pprint_safe_repr | 514 ms | 573 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| pprint_pformat | 1.05 sec | 1.15 sec: 1.10x slower |
+--------------------------+---------------+------------------------+
| pyflate | 355 ms | 414 ms: 1.17x slower |
+--------------------------+---------------+------------------------+
| python_startup | 22.7 ms | 24.2 ms: 1.07x slower |
+--------------------------+---------------+------------------------+
| python_startup_no_site | 19.1 ms | 20.2 ms: 1.06x slower |
+--------------------------+---------------+------------------------+
| raytrace | 194 ms | 231 ms: 1.19x slower |
+--------------------------+---------------+------------------------+
| regex_compile | 78.9 ms | 89.6 ms: 1.14x slower |
+--------------------------+---------------+------------------------+
| regex_dna | 131 ms | 122 ms: 1.07x faster |
+--------------------------+---------------+------------------------+
| regex_effbot | 2.26 ms | 2.19 ms: 1.03x faster |
+--------------------------+---------------+------------------------+
| richards | 34.4 ms | 41.6 ms: 1.21x slower |
+--------------------------+---------------+------------------------+
| richards_super | 38.9 ms | 46.7 ms: 1.20x slower |
+--------------------------+---------------+------------------------+
| scimark_fft | 209 ms | 215 ms: 1.03x slower |
+--------------------------+---------------+------------------------+
| scimark_lu | 78.7 ms | 93.3 ms: 1.18x slower |
+--------------------------+---------------+------------------------+
| scimark_monte_carlo | 49.1 ms | 54.4 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| scimark_sor | 95.9 ms | 107 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| scimark_sparse_mat_mult | 3.09 ms | 3.19 ms: 1.03x slower |
+--------------------------+---------------+------------------------+
| spectral_norm | 75.3 ms | 81.3 ms: 1.08x slower |
+--------------------------+---------------+------------------------+
| sqlglot_normalize | 180 ms | 202 ms: 1.12x slower |
+--------------------------+---------------+------------------------+
| sqlglot_optimize | 33.9 ms | 36.5 ms: 1.08x slower |
+--------------------------+---------------+------------------------+
| sqlglot_parse | 891 us | 1.01 ms: 1.14x slower |
+--------------------------+---------------+------------------------+
| sqlglot_transpile | 1.08 ms | 1.19 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| sqlite_synth | 1.80 us | 1.83 us: 1.01x slower |
+--------------------------+---------------+------------------------+
| telco | 4.28 ms | 4.56 ms: 1.06x slower |
+--------------------------+---------------+------------------------+
| tomli_loads | 1.65 sec | 1.74 sec: 1.05x slower |
+--------------------------+---------------+------------------------+
| typing_runtime_protocols | 107 us | 114 us: 1.07x slower |
+--------------------------+---------------+------------------------+
| unpack_sequence | 51.5 ns | 58.5 ns: 1.14x slower |
+--------------------------+---------------+------------------------+
| unpickle | 8.47 us | 8.83 us: 1.04x slower |
+--------------------------+---------------+------------------------+
| unpickle_list | 3.34 us | 3.28 us: 1.02x faster |
+--------------------------+---------------+------------------------+
| unpickle_pure_python | 157 us | 192 us: 1.22x slower |
+--------------------------+---------------+------------------------+
| xml_etree_parse | 95.2 ms | 102 ms: 1.07x slower |
+--------------------------+---------------+------------------------+
| xml_etree_iterparse | 66.1 ms | 71.5 ms: 1.08x slower |
+--------------------------+---------------+------------------------+
| xml_etree_generate | 57.0 ms | 63.5 ms: 1.11x slower |
+--------------------------+---------------+------------------------+
| xml_etree_process | 40.8 ms | 46.6 ms: 1.14x slower |
+--------------------------+---------------+------------------------+
| Geometric mean | (ref) | 1.10x slower |
+--------------------------+---------------+------------------------+

@zooba
Copy link
Member

zooba commented Jun 6, 2025

We build on Azure Pipelines infrastructure, which typically matches GitHub Action's images (with a couple days of eventual consistency as they deploy updates). We don't really want to maintain custom infrastructure for builds, though it does mean that updates arrive a bit slower.

Though having said that, back when we did maintain custom infrastructure, the updates arrived even slower. So we've improved by moving to shared images.

As a general rule, our binary distribution is useful and functional, but if you have special needs then you're more than welcome to build from source. You can get even more performance benefits for your scenario if you're willing to invest a little, so it really comes down to how much of that effort you want to try and persuade volunteers to do for you instead of doing it yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows pending The issue will be closed if no feedback is provided performance Performance or resource usage triaged The issue has been accepted as valid by a triager.
Projects
None yet
Development

No branches or pull requests

4 participants