-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Python3.13 performance Issue with python.org macOS installers on ARM Macs #122580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Me and @devdanzin have figured out that this is only on ARM, Windows and Linux x86 seemed to be faster based on the benchmarks. |
Thanks for this report. Across a broad range of benchmarks measuring on the Faster CPython benchmarking machine, we see between a 3% and 8% speedup for 3.13.0b4 vs. 3.12.0 on a MacOS ARM machine. That doesn't mean that this particular example hasn't found a narrow regression case. In our own benchmarking, some specific benchmarks are in fact slower, but only I also noticed that you have |
I just tried to reproduce the above result on our benchmarking M1 Mac, also running macOS 14.5, and got different results:
I suspect this comes down to different configurations between the builds. It looks like 3.12.0 on your machine was built on a different machine? Try building each in exactly the same environment and see if you can still reproduce. |
I am using 3.12.4, not 3.12.0. All the versions installed on my machine are the default ones from Python.org. |
I also can't reproduce with local optimized builds on an M2 Pro running Sonoma 14.5 (3 runs each):
However, I can reproduce with the python.org downloads:
So the issue seems to be with the releases themselves. @ned-deily, anything different with macOS releases this time around? |
So, this is kinda strange. Consider this reproducer: import time
s = "X" * 100_000_000
start = time.perf_counter()
for _ in s:
pass
duration = time.perf_counter() - start
print(f"global: {duration:.3f}s")
def foo():
for _ in s:
pass
start = time.perf_counter()
foo()
duration = time.perf_counter() - start
print(f"function: {duration:.3f}s") With the 3.12.4 python.org download, this prints:
On 3.13.0rc1:
So the loop over the string is faster at module level on 3.13, but faster at function level in 3.12. And 3.13 is faster on both if I replace |
The issue could be some sort of messed-up (or skipped) PGO with the release build. The global version has lots of But at this point I'm just guessing. |
Just dittoing this. I agree, there may be something that changed in the build environment between releases. |
I can reproduce a similar slower performance between the python.org 3.12.4 and 3.13.0rc1 builds on an M3 MacBook Pro with macOS 14.6. However, using the same python.org installers on an Intel i9 iMac also with macOS 14.6 shows a performance improvement between 3.12.4 and 3.13.0rc1. These are both universal2 builds meaning they should be running native arm or x86_64 code on each. And, interestingly, if I force running the universal2 builds in x86_64 mode using Rosetta2 emulation on the same M3 (by using python3.1x-intel64), I see performance differences similar to running natively on the Intel Mac, that is 3.13.0rc1 is somewhat faster. There are a number of differences in the build environments used for 3.12.4 and for 3.13.0rc1, including the compiler versions and deployment targets. I agree that something with optimizations seems a likely culprit. I will dig into it and report back. Thanks for the report and the simple reproducer, @hg0428! |
A quick update: I've run the test case on the full range of python.org macOS installers from 3.13.0a5 through 3.13.0rc1 on both Apple Silicon (M3, arm64) and Intel (i9) Macs and it appears that rc1 on Apple Silicon is the outlier. I also ran the test case on all of the macOS installers where we also provided a free-threading build, that is, 3.13.0b2 through 3.13.0rc1, again, on both CPU types; there, the test-case performance differences roughly mirror those of the the normal (GIL) builds except that the rc1 free-threading version does not exhibit the performance degradation that the normal version does on Apple Silicon. So, the good news at this point is that this particular performance issue appears to be isolated but there is more work to be done to pin it down further. More tomorrow. |
Thanks for looking at this, @ned-deily. I'm going to run the entire pyperformance suite against the python.org builds of 3.12.4 and 3.13.0rc1 on our lab machine so we'll have something broader than just the example in the OP. This will take a couple of hours and I'll report back. |
The results are in. The 3.13.0rc1 python.org is about 9% slower overall than the 3.12.4 python.org build. The results for most benchmarks are slower, so I don't think this is an isolated "worst case" in the OP here. This doesn't really get at the "how to fix", but may at least provide some clues as to what it happening. I put the detailed results in a gist. As a point of comparison, here is 3.13.0rc1 vs. 3.12.0 where each binary was built on the same machine, compiler and build environment. Maybe the different ways in which specific benchmarks move may provide some clues as to the issue. EDIT: Fixed link |
This is a 404. |
Sorry. I fixed the link above. It should be: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20240731-3.13.0rc1-e4a3e78/bm-20240731-darwin-arm64-python-v3.13.0rc1-3.13.0rc1-e4a3e78-vs-3.12.0.svg |
Out of curiosity, is the plan to update the builds on python.org, or just fix it with the next RC? |
TBD at this point |
Which architecture does the build machine for the installer use? This might be PGO related (as @brandtbucher noted) if the build machine is an Intel Mac. It is not clear to me if a |
TL;DR The performance degradation demonstrated by this test case unexpectedly seems to be due to the use of Longer term, we should try to pin down what the issue is with using LTO in this environment. Fortunately, this particular performance problem is quite easy to reproduce. Initially, due to the somewhat complex process needed to build the installer to meet both our and Apple's requirements for distribution, i.e. one build that natively supports a very broad range of Mac models and macOS releases, I suspected that the performance problem reported was most likely due to one of those special requirements: framework build, universal (multi-architecture) binaries, hardened runtime, codesigning, installer package notarization, etc. However, I was eventually able to eliminate all of those factors ending up with comparing three very simple configurations, essentially:
A summary of the results of running the test case with
macOS 14.6 Build 23G80 was used on real and virtual machines on both Apple Silicon (M3) and Intel (i9) Macs. The build tools (clang, ld64, etc.) supplied with Xcode 15.4, Build 15F31d were used throughout building natively (either arm64 or x86_64) for 1, 2, and 3 and as a universal2 build (arm64 and x86_64 fat binaries) on the Apple Silicon Mac for the installers. Various other builds of earlier 3.13 pre-releases and 3.12.4 were made and tested in some of the same environments and with an older version of the Xcode tools on an older version of macOS. Also, published installers for previous 3.13 pre-releases were spot-tested. In general, the results often, but not always, showed an improvement when not using LTO. But none were as dramatic as the 3.13.0rc1 results. Why that should be is an interesting open question. Also, interesting is the much wider range of results on Apple Silicon Macs compared with Intel Macs. As to why this particular apparent performance regression went unnoticed, our focus in providing the installers is to support as wide a range of use cases as practical, including natively supporting all Macs that are supported with a wide range of macOS operating system releases, for 3.13, macOS 10.13 and newer (for 3.12.x, macOS 10.9 on). That also means that performance, while important, is not the highest priority for installer releases as supporting a wide range of systems involves tradeoffs. In other words, if performance on macOS is of critical importance, you probably should be looking at a build tailored to your specific environment and requirements. So, with our limited resources, we depend to a large extent on external testing and feedback with regard to performance. In this case, though, it is surprising that this regression came to the fore with rc1 as there were no significant differences between recent betas and rc1 as to the installer build process. A big change occurred between 3.13.0b1 and b2 when installer builds changed from building on an Intel Mac environment with macOS 11 and its Xcode tools supporting macOS 10.9+ to building on an Apple Silicon Mac environment with macOS 14 and its Xcode tools supporting 10.13+. The Xcode build tools support building universal builds on either architecture but, as @ronaldoussoren brought up, that raises the question about what impact the build host architecture might have on PGO and other optimizations. About a year ago, I ran a set of various builds and pyperformance benchmarks on a group of several Macs, both Intel and Apple Silicon, all running the same operating system version. At that time, no significant differences were noted when running universal builds on the opposite architecture which gave some confidence when making the build environment switch at 3.13.0b2. A more important factor was likely the newer clang and related tools in the newer Xcode releases. It would be good to periodically repeat that kind of environment testing but it is of lower priority (though such differences do not appear to be a factor in this particular test case). At this point, I think the next step is to run a full pyperformance suite using the rebuilt installer (which I can make available to test if @mdboom is willing!) and, if the results look good, I will make a recommendation to @Yhg1s as release manager that we replace the rc1 installer along with an announcement. We can then look at further action items to better understand the root cause. |
I'll go ahead and run pyperformance on the new installer, since Mike is out. I'm curious, does the performance only come back with both I'm mostly asking because it feels a bit icky to me to "fix" this regression by just turning off LTO entirely, just because it seems to make the numbers better for some reason. |
AArch64 run finished (x86_64 is still chugging along). Not a particularly quiet machine, but it looks like there's a major improvement with the new installer, from 14% slower (old) to 1% slower (new).
For reference, here are our team's "official" benchmarks comparing 3.13.0rc0 vs. 3.12.0 on this platform last week. |
I'm seeing similar results on an x86_64 Mac (from 4% slower to 8% faster):
|
See also #122832 for some extra benchmarking on the degradation. |
About the ARM, do we have benchmarking for Linux aarch64 now ? |
Yes, thanks to a gracious donation from ARM, Inc. You can see them here |
I was also able to confirm the improvement on the official builds that @brandtbucher reported above. On that basis, if we needed to cut a release today I think the solution is clear that we just go with the thing that we've measured to be an improvement. However, it is surprising that turning off an optimization (LTO in this case) results in a speedup -- and I worry that by not understanding why, we just open ourselves up to other random changes going forward. As a first step, I thought I would try to reproduce the result on our own hardware / build environment. If I take the same 3.10.0rc1 commit and build it with:
(2) is about 7% slower than (1). This is the opposite effect that we see with @ned-deily's official builds, but it's not surprising that turning off LTO would make things slower. (This is all ARM64 as host and target -- I didn't cross compile and I don't have access to any Intel Mac hardware). It might make sense to break off the release blocker (how do we get a good build now) and create a new issue to investigate preventing this going forward. |
Thanks for the additional data points, @mdboom. After a vacation break, I am back working on this, with a resolution prior to next week's 3.13.0rc2. |
Extensive testing with the microbenchmarks provided in this issue and in #122832 confirm that (1) there was a significant performance degradation for at least some test cases with the 3.13.0rc1 provided by the python.org installer particularly on Apple Silicon Macs and that (2) that performance degradation appears to be due to the use of the ThinLTO (Link Time Optimization) Python build option ( Since performance can vary greatly due to many factors, it is unreasonable to expect the same level of improvement in more extensive real-world benchmarks but it seems likely that an overall improvement of some magnitude may be seen with most benchmarks. From reading a bit about ThinLTO, it seems one of its biggest advantages was expected to be providing faster incremental builds rather than doing the monolithic optimization for each build. For the python.org installer case, that isn't an advantage since we always build the binary components of the installer from scratch for each release. And the documentation implies that, in most cases, the monolithic LTO would be expected to provide better run-time performance anyway. So all of that seems to indicate that full LTO is the right choice for the installer build in any case. Moving forward, with the imminent release of 3.13.0rc2, we will use While there are still unanswered questions about exactly what it is about ThinLTO that causes the performance degradation in these environments, I am not planning to delve deeper into it since it's not something I have particular expertise in and it seems that using full LTO is a perfectly satisfactory solution for the macOS installer. Feel free to dive in yourself! Thanks again to everyone who has helped out with reporting and investigation this! Appended here is a summary of the more pertinent test results while investigating this issue: Test Cases
Apple Silicon (MacBook Pro 14-inch, Nov 2023, M3, 16GB)Summary
Intel (iMac Retina 5K 27-inch, 2019, 3.6 GHz 8-Core Intel Core i9, 40GB)Summary
Build configurations
|
Thanks for all your hard work on this, @ned-deily. This finding seems reasonable. I've put the rc2 release date on my calendar and will run it against the full pyperformance suite then to confirm. |
Thanks @ned-deily. I wonder, should the default for In general, if I'm passing |
What's the default LTO (full or thin) for the Linux builds? |
IIUC, thin LTO is only supported on llvm/clang, so our gcc-based Linux builds don't use it. |
The default for |
Just tested 3.13rc2 and the performance issues are solved. Thank you. |
Same (on our usual macOS lab machine). |
closing as fixed, thanks for testing @rhettinger & @mdboom ! |
FWIW, 3.13.0rc3 differences noted: #124567 (comment) |
Bug description:
Python3.13rc1 seems to be about 20-25% slower than Python3.12.4.
Platform: M1 Mac running MacOS Sonoma 14.5
I first noticed the issue when running this repository where Python3.13rc1 ran on average 20-25% slower than Python3.12.4
With help from the Python discord server, we were able to create this minimal reproducible example:
These were the results across many tests:
3.12.4: 7.742122s
3.13rc1: 8.309326s
The performance issue was not reproducible on other Operating Systems.
CPython versions tested on:
3.11.1, 3.12.4, 3.13
CPython Options:
For my testing, I used the official Python3.13rc1 installer from Python.org, which has the following configuration:
OPT:
3.13rc1:
-DNDEBUG -g -O3 -Wall
3.12.4:
-DNDEBUG -g -O3 -Wall
CONFIG_ARGS:
3.13rc1:
'--enable-framework' '--with-framework-name=Python' '--enable-universalsdk=/' '--with-universal-archs=universal2' '--enable-optimizations' '--with-lto' '--without-ensurepip' '--with-system-libmpdec' '--with-openssl=/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local' 'LIBLZMA_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBLZMA_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -llzma' 'LIBMPDEC_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBMPDEC_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -lmpdec -lm' 'LIBSQLITE3_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBSQLITE3_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -lsqlite3' 'TCLTK_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'TCLTK_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -ltcl -ltk' 'CC=clang'
3.12.4:
'-C' '--enable-framework' '--enable-universalsdk=/' '--with-universal-archs=universal2' '--with-computed-gotos' '--without-ensurepip' '--with-openssl=/tmp/_py/libraries/usr/local' '--enable-optimizations' '--with-lto' 'TCLTK_CFLAGS=-I/tmp/_py/libraries/usr/local/include' 'TCLTK_LIBS=-ltcl8.6 -ltk8.6' 'LDFLAGS=-g' 'CFLAGS=-g' 'CC=clang'
Operating systems tested on:
Linux, macOS, Windows
The text was updated successfully, but these errors were encountered: