Skip to content

Python3.13 performance Issue with python.org macOS installers on ARM Macs #122580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hg0428 opened this issue Aug 1, 2024 · 36 comments
Closed

Python3.13 performance Issue with python.org macOS installers on ARM Macs #122580

hg0428 opened this issue Aug 1, 2024 · 36 comments
Assignees
Labels
3.13 bugs and security fixes build The build process and cross-build OS-mac performance Performance or resource usage

Comments

@hg0428
Copy link

hg0428 commented Aug 1, 2024

Bug description:

Python3.13rc1 seems to be about 20-25% slower than Python3.12.4.
Platform: M1 Mac running MacOS Sonoma 14.5

I first noticed the issue when running this repository where Python3.13rc1 ran on average 20-25% slower than Python3.12.4

With help from the Python discord server, we were able to create this minimal reproducible example:

from time import process_time

class Lexer:
    def __init__(self, data):
        self.data = data
        self.index = 0

    def detect(self, text):
        i = 0
        for char in text:
            if self.peek(i) != char:
                return False
            i += 1
        return True

    def peek(self, amt=1):
        if self.index + amt < len(self.data):
            return self.data[self.index + amt]
        else:
            return None

text = "return text + indent * ' ' + '}'\n" * 200
needle = "return text + indent * ' ' + '}'\n" * 199 + "Nope\n" 
lexer = Lexer(text)
start = process_time() 
for x in range(10_000):
    lexer.detect(needle)
print(process_time() - start)

These were the results across many tests:
3.12.4: 7.742122s
3.13rc1: 8.309326s

The performance issue was not reproducible on other Operating Systems.

CPython versions tested on:

3.11.1, 3.12.4, 3.13

CPython Options:

For my testing, I used the official Python3.13rc1 installer from Python.org, which has the following configuration:

OPT:

3.13rc1: -DNDEBUG -g -O3 -Wall
3.12.4: -DNDEBUG -g -O3 -Wall

CONFIG_ARGS:

3.13rc1: '--enable-framework' '--with-framework-name=Python' '--enable-universalsdk=/' '--with-universal-archs=universal2' '--enable-optimizations' '--with-lto' '--without-ensurepip' '--with-system-libmpdec' '--with-openssl=/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local' 'LIBLZMA_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBLZMA_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -llzma' 'LIBMPDEC_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBMPDEC_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -lmpdec -lm' 'LIBSQLITE3_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'LIBSQLITE3_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -lsqlite3' 'TCLTK_CFLAGS=-I/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/include' 'TCLTK_LIBS=-L/Users/nad/release-tools/macos-installer/installer/variant/binaries/build/libraries/usr/local/lib -ltcl -ltk' 'CC=clang'
3.12.4: '-C' '--enable-framework' '--enable-universalsdk=/' '--with-universal-archs=universal2' '--with-computed-gotos' '--without-ensurepip' '--with-openssl=/tmp/_py/libraries/usr/local' '--enable-optimizations' '--with-lto' 'TCLTK_CFLAGS=-I/tmp/_py/libraries/usr/local/include' 'TCLTK_LIBS=-ltcl8.6 -ltk8.6' 'LDFLAGS=-g' 'CFLAGS=-g' 'CC=clang'

Operating systems tested on:

Linux, macOS, Windows

@hg0428 hg0428 added the type-bug An unexpected behavior, bug, or error label Aug 1, 2024
@mdboom mdboom added the performance Performance or resource usage label Aug 1, 2024
@ZeroIntensity
Copy link
Member

Me and @devdanzin have figured out that this is only on ARM, Windows and Linux x86 seemed to be faster based on the benchmarks.

@mdboom
Copy link
Contributor

mdboom commented Aug 1, 2024

Thanks for this report. Across a broad range of benchmarks measuring on the Faster CPython benchmarking machine, we see between a 3% and 8% speedup for 3.13.0b4 vs. 3.12.0 on a MacOS ARM machine.

That doesn't mean that this particular example hasn't found a narrow regression case. In our own benchmarking, some specific benchmarks are in fact slower, but only telco is that much slower (which is a known regression in the decimal module). Maybe some line profiling of 3.12 vs. 3.13 would be helpful here, or some pystats results.

I also noticed that you have --with-computed-gotos for 3.12, but not 3.13 here. That may not be relevant since it's on by default anyway (and I'm not sure it even applies for clang), but we should make sure all other variables are equal.

@mdboom
Copy link
Contributor

mdboom commented Aug 1, 2024

I just tried to reproduce the above result on our benchmarking M1 Mac, also running macOS 14.5, and got different results:

3.12.0: 7.471261
3.13.0rc1: 5.986742

I suspect this comes down to different configurations between the builds. It looks like 3.12.0 on your machine was built on a different machine? Try building each in exactly the same environment and see if you can still reproduce.

@hg0428
Copy link
Author

hg0428 commented Aug 1, 2024

I just tried to reproduce the above result on our benchmarking M1 Mac, also running macOS 14.5, and got different results:

3.12.0: 7.471261
3.13.0rc1: 5.986742

I suspect this comes down to different configurations between the builds. It looks like 3.12.0 on your machine was built on a different machine? Try building each in exactly the same environment and see if you can still reproduce.

I am using 3.12.4, not 3.12.0.
I have tested 3.11, 3.12.1, 3.12.3, all from Python.org at different points in time.
I have also tested many different betas of 3.13, and the results have been consistent.

All the versions installed on my machine are the default ones from Python.org.

@brandtbucher
Copy link
Member

I also can't reproduce with local optimized builds on an M2 Pro running Sonoma 14.5 (3 runs each):

v3.12.4:          5.718666, 6.035776, 5.577658
v3.13.0rc1:       4.933225, 5.214851, 4.837962
v3.13.0rc1 (JIT): 4.746776, 4.747096, 4.764923

However, I can reproduce with the python.org downloads:

3.12.4:           5.684782, 5.808816, 5.734691
3.13.0rc1:        6.673098, 6.930707, 7.369320

So the issue seems to be with the releases themselves. @ned-deily, anything different with macOS releases this time around?

@brandtbucher
Copy link
Member

So, this is kinda strange. Consider this reproducer:

import time

s = "X" * 100_000_000

start = time.perf_counter()
for _ in s:
    pass
duration = time.perf_counter() - start
print(f"global: {duration:.3f}s")

def foo():
    for _ in s:
        pass

start = time.perf_counter()
foo()
duration = time.perf_counter() - start
print(f"function: {duration:.3f}s")

With the 3.12.4 python.org download, this prints:

global: 1.237s
function: 0.364s

On 3.13.0rc1:

global: 0.930s
function: 0.680s

So the loop over the string is faster at module level on 3.13, but faster at function level in 3.12. And 3.13 is faster on both if I replace "X" * 100_000_000 with range(100_000_000).

@brandtbucher brandtbucher added OS-mac 3.13 bugs and security fixes labels Aug 1, 2024
@brandtbucher
Copy link
Member

brandtbucher commented Aug 1, 2024

The issue could be some sort of messed-up (or skipped) PGO with the release build. The global version has lots of FOR_ITER -> STORE_NAME pairs while the function version has lots of FOR_ITER -> STORE_FAST pairs. That's really the only difference that I can see. I'd expect PGO to make the latter very fast, at the expense of the former.

But at this point I'm just guessing.

@mdboom
Copy link
Contributor

mdboom commented Aug 1, 2024

However, I can reproduce with the python.org downloads:

Just dittoing this. I agree, there may be something that changed in the build environment between releases.

@ned-deily
Copy link
Member

ned-deily commented Aug 1, 2024

I can reproduce a similar slower performance between the python.org 3.12.4 and 3.13.0rc1 builds on an M3 MacBook Pro with macOS 14.6. However, using the same python.org installers on an Intel i9 iMac also with macOS 14.6 shows a performance improvement between 3.12.4 and 3.13.0rc1. These are both universal2 builds meaning they should be running native arm or x86_64 code on each. And, interestingly, if I force running the universal2 builds in x86_64 mode using Rosetta2 emulation on the same M3 (by using python3.1x-intel64), I see performance differences similar to running natively on the Intel Mac, that is 3.13.0rc1 is somewhat faster.

There are a number of differences in the build environments used for 3.12.4 and for 3.13.0rc1, including the compiler versions and deployment targets. I agree that something with optimizations seems a likely culprit. I will dig into it and report back.

Thanks for the report and the simple reproducer, @hg0428!

@ned-deily ned-deily self-assigned this Aug 1, 2024
@ned-deily
Copy link
Member

ned-deily commented Aug 2, 2024

A quick update: I've run the test case on the full range of python.org macOS installers from 3.13.0a5 through 3.13.0rc1 on both Apple Silicon (M3, arm64) and Intel (i9) Macs and it appears that rc1 on Apple Silicon is the outlier. I also ran the test case on all of the macOS installers where we also provided a free-threading build, that is, 3.13.0b2 through 3.13.0rc1, again, on both CPU types; there, the test-case performance differences roughly mirror those of the the normal (GIL) builds except that the rc1 free-threading version does not exhibit the performance degradation that the normal version does on Apple Silicon. So, the good news at this point is that this particular performance issue appears to be isolated but there is more work to be done to pin it down further. More tomorrow.

@mdboom
Copy link
Contributor

mdboom commented Aug 2, 2024

Thanks for looking at this, @ned-deily.

I'm going to run the entire pyperformance suite against the python.org builds of 3.12.4 and 3.13.0rc1 on our lab machine so we'll have something broader than just the example in the OP. This will take a couple of hours and I'll report back.

@mdboom
Copy link
Contributor

mdboom commented Aug 2, 2024

The results are in. The 3.13.0rc1 python.org is about 9% slower overall than the 3.12.4 python.org build. The results for most benchmarks are slower, so I don't think this is an isolated "worst case" in the OP here. This doesn't really get at the "how to fix", but may at least provide some clues as to what it happening.

I put the detailed results in a gist.

As a point of comparison, here is 3.13.0rc1 vs. 3.12.0 where each binary was built on the same machine, compiler and build environment. Maybe the different ways in which specific benchmarks move may provide some clues as to the issue.

EDIT: Fixed link

@hugovk
Copy link
Member

hugovk commented Aug 2, 2024

here is 3.13.0rc1 vs. 3.12.0

This is a 404.

@mdboom
Copy link
Contributor

mdboom commented Aug 2, 2024

here is 3.13.0rc1 vs. 3.12.0

This is a 404.

Sorry. I fixed the link above. It should be: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20240731-3.13.0rc1-e4a3e78/bm-20240731-darwin-arm64-python-v3.13.0rc1-3.13.0rc1-e4a3e78-vs-3.12.0.svg

@ZeroIntensity
Copy link
Member

Out of curiosity, is the plan to update the builds on python.org, or just fix it with the next RC?

@ned-deily
Copy link
Member

is the plan to update the builds on python.org, or just fix it with the next RC?

TBD at this point

@ronaldoussoren
Copy link
Contributor

A quick update: I've run the test case on the full range of python.org macOS installers from 3.13.0a5 through 3.13.0rc1 on both Apple Silicon (M3, arm64) and Intel (i9) Macs and it appears that rc1 on Apple Silicon is the outlier. I also ran the test case on all of the macOS installers where we also provided a free-threading build, that is, 3.13.0b2 through 3.13.0rc1, again, on both CPU types; there, the test-case performance differences roughly mirror those of the the normal (GIL) builds except that the rc1 free-threading version does not exhibit the performance degradation that the normal version does on Apple Silicon. So, the good news at this point is that this particular performance issue appears to be isolated but there is more work to be done to pin it down further. More tomorrow.

Which architecture does the build machine for the installer use? This might be PGO related (as @brandtbucher noted) if the build machine is an Intel Mac.

It is not clear to me if a --enable-universalsdk PGO build can use profile information for both architectures when only collecting that information for one architecture.

@ned-deily
Copy link
Member

TL;DR The performance degradation demonstrated by this test case unexpectedly seems to be due to the use of --with-lto(=thin) as seen with the current version of the Apple-supplied clang/llvm-based build tools. The fix for the problem is to avoid using the --with-lto configure option when producing our Python for macOS installers. Rebuilding the 3.13.0rc1 installer with exactly the same source and build environment except for deleting ---with-lto and adding --with-computed-gotos resulted in speedups of this test case of up to 40% over the released 3.13.0rc1 installer on one Apple Silicon Mac and about 4.6% on one Intel-based Mac. While this test case is likely an outlier, there is other anecdotal evidence of some performance improvements in other tests. I propose that the rebuilt installer be subjected to the full pyperformance suite and, assuming that the results are judged to be an overall improvement and with the release manager's approval, that we replace the download file on python.org and make an announcement of the updated installer.

Longer term, we should try to pin down what the issue is with using LTO in this environment. Fortunately, this particular performance problem is quite easy to reproduce. Initially, due to the somewhat complex process needed to build the installer to meet both our and Apple's requirements for distribution, i.e. one build that natively supports a very broad range of Mac models and macOS releases, I suspected that the performance problem reported was most likely due to one of those special requirements: framework build, universal (multi-architecture) binaries, hardened runtime, codesigning, installer package notarization, etc. However, I was eventually able to eliminate all of those factors ending up with comparing three very simple configurations, essentially:

# 1. no optimizations
./configure --with-computed-gotos && make
# 2. optimizations and thin LTO
./configure --with-computed-gotos --enable-optimizations --with-lto && make
# 3. optimizations with no LTO
./configure --with-computed-gotos --enable-optimizations && make

A summary of the results of running the test case with v3.13.0rc1:

Build type            M3 Mac (arm64)      Intel Mac (x86_64)
==================    ================    ================
1 - no opt            4.80, 4.78, 4.85    8.69, 8.68, 8.90
2 - opt with LTO      5.97, 6.11, 6.26    8.91, 9.00, 8.99
3 - opt, no LTO       4.56, 4.47, 5.06    8.39, 8.36, 8.38

released installer    6.67, 6.49, 6.59    8.87, 8.93, 8.88
rebuilt installer     4.48, 4.49, 4.44    8.65, 8.54, 8.54

macOS 14.6 Build 23G80 was used on real and virtual machines on both Apple Silicon (M3) and Intel (i9) Macs. The build tools (clang, ld64, etc.) supplied with Xcode 15.4, Build 15F31d were used throughout building natively (either arm64 or x86_64) for 1, 2, and 3 and as a universal2 build (arm64 and x86_64 fat binaries) on the Apple Silicon Mac for the installers. released installer was built with --enable-optimizations and --with-lto. while rebuilt installer was built with --enable-optimizations and --with-computed-gotos.

Various other builds of earlier 3.13 pre-releases and 3.12.4 were made and tested in some of the same environments and with an older version of the Xcode tools on an older version of macOS. Also, published installers for previous 3.13 pre-releases were spot-tested. In general, the results often, but not always, showed an improvement when not using LTO. But none were as dramatic as the 3.13.0rc1 results. Why that should be is an interesting open question. Also, interesting is the much wider range of results on Apple Silicon Macs compared with Intel Macs.

As to why this particular apparent performance regression went unnoticed, our focus in providing the installers is to support as wide a range of use cases as practical, including natively supporting all Macs that are supported with a wide range of macOS operating system releases, for 3.13, macOS 10.13 and newer (for 3.12.x, macOS 10.9 on). That also means that performance, while important, is not the highest priority for installer releases as supporting a wide range of systems involves tradeoffs. In other words, if performance on macOS is of critical importance, you probably should be looking at a build tailored to your specific environment and requirements. So, with our limited resources, we depend to a large extent on external testing and feedback with regard to performance.

In this case, though, it is surprising that this regression came to the fore with rc1 as there were no significant differences between recent betas and rc1 as to the installer build process. A big change occurred between 3.13.0b1 and b2 when installer builds changed from building on an Intel Mac environment with macOS 11 and its Xcode tools supporting macOS 10.9+ to building on an Apple Silicon Mac environment with macOS 14 and its Xcode tools supporting 10.13+. The Xcode build tools support building universal builds on either architecture but, as @ronaldoussoren brought up, that raises the question about what impact the build host architecture might have on PGO and other optimizations. About a year ago, I ran a set of various builds and pyperformance benchmarks on a group of several Macs, both Intel and Apple Silicon, all running the same operating system version. At that time, no significant differences were noted when running universal builds on the opposite architecture which gave some confidence when making the build environment switch at 3.13.0b2. A more important factor was likely the newer clang and related tools in the newer Xcode releases. It would be good to periodically repeat that kind of environment testing but it is of lower priority (though such differences do not appear to be a factor in this particular test case).

At this point, I think the next step is to run a full pyperformance suite using the rebuilt installer (which I can make available to test if @mdboom is willing!) and, if the results look good, I will make a recommendation to @Yhg1s as release manager that we replace the rc1 installer along with an announcement. We can then look at further action items to better understand the root cause.

@brandtbucher
Copy link
Member

I'll go ahead and run pyperformance on the new installer, since Mike is out.

I'm curious, does the performance only come back with both --with-lto removed and --with-computed-gotos added, or did you also try just adding back --with-computed-gotos? Was there a reason it was removed in the first place?

I'm mostly asking because it feels a bit icky to me to "fix" this regression by just turning off LTO entirely, just because it seems to make the numbers better for some reason.

@brandtbucher
Copy link
Member

AArch64 run finished (x86_64 is still chugging along).

Not a particularly quiet machine, but it looks like there's a major improvement with the new installer, from 14% slower (old) to 1% slower (new).

Benchmark 3.12.4 3.13.0rc1 (old) 3.13.0rc1 (new)
2to3 158 ms 186 ms: 1.18x slower 163 ms: 1.03x slower
async_generators 279 ms 294 ms: 1.05x slower 297 ms: 1.06x slower
async_tree_none 225 ms 216 ms: 1.04x faster 194 ms: 1.16x faster
async_tree_cpu_io_mixed 458 ms 447 ms: 1.03x faster 417 ms: 1.10x faster
async_tree_cpu_io_mixed_tg 464 ms 433 ms: 1.07x faster 411 ms: 1.13x faster
async_tree_eager 61.8 ms 76.1 ms: 1.23x slower 64.5 ms: 1.04x slower
async_tree_eager_cpu_io_mixed 333 ms 347 ms: 1.04x slower 329 ms: 1.01x faster
async_tree_eager_cpu_io_mixed_tg 316 ms not significant 303 ms: 1.05x faster
async_tree_eager_io 645 ms 705 ms: 1.09x slower not significant
async_tree_eager_io_tg 579 ms 707 ms: 1.22x slower 666 ms: 1.15x slower
async_tree_eager_memoization 156 ms 164 ms: 1.05x slower 148 ms: 1.05x faster
async_tree_eager_memoization_tg 131 ms 137 ms: 1.05x slower 125 ms: 1.05x faster
async_tree_eager_tg 42.8 ms 51.8 ms: 1.21x slower 43.4 ms: 1.01x slower
async_tree_io 572 ms 549 ms: 1.04x faster 506 ms: 1.13x faster
async_tree_io_tg 582 ms 553 ms: 1.05x faster 500 ms: 1.16x faster
async_tree_memoization 268 ms 263 ms: 1.02x faster 236 ms: 1.13x faster
async_tree_memoization_tg 275 ms 262 ms: 1.05x faster 226 ms: 1.22x faster
async_tree_none_tg 216 ms 197 ms: 1.10x faster 177 ms: 1.21x faster
asyncio_tcp 347 ms not significant 318 ms: 1.09x faster
asyncio_websockets 383 ms 381 ms: 1.00x faster 382 ms: 1.00x faster
chameleon 4.14 ms 6.04 ms: 1.46x slower 4.28 ms: 1.03x slower
chaos 38.8 ms 45.2 ms: 1.16x slower 35.0 ms: 1.11x faster
comprehensions 10.4 us 12.7 us: 1.23x slower 9.97 us: 1.04x faster
bench_mp_pool 45.5 ms 44.5 ms: 1.02x faster 44.7 ms: 1.02x faster
bench_thread_pool 487 us 566 us: 1.16x slower 514 us: 1.05x slower
coroutines 15.4 ms 19.7 ms: 1.28x slower 15.2 ms: 1.01x faster
coverage 31.5 ms 56.2 ms: 1.78x slower 55.6 ms: 1.76x slower
crypto_pyaes 53.2 ms 52.6 ms: 1.01x faster 45.1 ms: 1.18x faster
dask 210 ms 218 ms: 1.04x slower 213 ms: 1.02x slower
deepcopy 205 us 257 us: 1.25x slower 212 us: 1.03x slower
deepcopy_reduce 1.87 us 2.22 us: 1.19x slower 1.91 us: 1.02x slower
deepcopy_memo 22.3 us 34.2 us: 1.54x slower 23.2 us: 1.04x slower
deltablue 2.07 ms 3.08 ms: 1.49x slower 2.09 ms: 1.01x slower
django_template 20.9 ms 24.2 ms: 1.16x slower not significant
docutils 1.35 sec 1.46 sec: 1.09x slower 1.44 sec: 1.07x slower
dulwich_log 26.9 ms 27.7 ms: 1.03x slower 26.2 ms: 1.03x faster
fannkuch 271 ms 311 ms: 1.15x slower 278 ms: 1.03x slower
float 47.1 ms 60.3 ms: 1.28x slower 46.1 ms: 1.02x faster
create_gc_cycles 659 us 838 us: 1.27x slower 837 us: 1.27x slower
gc_traversal 2.41 ms 2.40 ms: 1.00x faster 2.42 ms: 1.01x slower
generators 22.2 ms 31.8 ms: 1.43x slower 20.3 ms: 1.09x faster
genshi_text 13.5 ms 19.0 ms: 1.40x slower 13.7 ms: 1.01x slower
genshi_xml 28.0 ms 38.0 ms: 1.36x slower 29.7 ms: 1.06x slower
go 85.4 ms 124 ms: 1.45x slower 98.3 ms: 1.15x slower
hexiom 3.68 ms 5.65 ms: 1.54x slower 3.84 ms: 1.04x slower
html5lib 29.5 ms 34.3 ms: 1.16x slower 30.6 ms: 1.04x slower
json_dumps 6.22 ms 6.38 ms: 1.03x slower 6.13 ms: 1.01x faster
json_loads 16.1 us 14.9 us: 1.08x faster 15.8 us: 1.01x faster
logging_format 3.60 us 4.05 us: 1.13x slower 3.45 us: 1.04x faster
logging_silent 57.0 ns 76.2 ns: 1.34x slower 59.9 ns: 1.05x slower
logging_simple 3.31 us 3.74 us: 1.13x slower 3.16 us: 1.05x faster
mako 6.63 ms 8.72 ms: 1.32x slower 6.50 ms: 1.02x faster
mdp 1.39 sec 1.50 sec: 1.08x slower 1.50 sec: 1.08x slower
meteor_contest 68.4 ms 71.2 ms: 1.04x slower 65.9 ms: 1.04x faster
nbody 60.5 ms 77.8 ms: 1.29x slower 65.5 ms: 1.08x slower
nqueens 56.5 ms 72.4 ms: 1.28x slower 53.1 ms: 1.06x faster
pathlib 20.7 ms 19.7 ms: 1.05x faster 19.7 ms: 1.05x faster
pickle 6.95 us 7.50 us: 1.08x slower 7.74 us: 1.11x slower
pickle_dict 17.0 us 18.8 us: 1.10x slower 18.6 us: 1.09x slower
pickle_list 2.67 us 2.78 us: 1.04x slower 2.74 us: 1.03x slower
pickle_pure_python 170 us 233 us: 1.37x slower 177 us: 1.04x slower
pidigits 252 ms 259 ms: 1.03x slower 257 ms: 1.02x slower
pprint_safe_repr 457 ms 576 ms: 1.26x slower 469 ms: 1.03x slower
pprint_pformat 930 ms 1.18 sec: 1.27x slower 956 ms: 1.03x slower
pyflate 288 ms 355 ms: 1.23x slower 300 ms: 1.04x slower
python_startup 13.4 ms 14.0 ms: 1.05x slower 15.6 ms: 1.17x slower
python_startup_no_site 11.1 ms 11.2 ms: 1.01x slower 11.9 ms: 1.07x slower
raytrace 190 ms 201 ms: 1.06x slower 154 ms: 1.24x faster
regex_compile 71.1 ms 85.8 ms: 1.21x slower 65.9 ms: 1.08x faster
regex_dna 135 ms 136 ms: 1.01x slower 131 ms: 1.03x faster
regex_effbot 2.28 ms 2.23 ms: 1.02x faster 2.35 ms: 1.03x slower
regex_v8 15.0 ms 15.7 ms: 1.05x slower 15.4 ms: 1.02x slower
richards 27.9 ms 40.0 ms: 1.43x slower 30.4 ms: 1.09x slower
richards_super 31.3 ms 44.1 ms: 1.41x slower 33.0 ms: 1.06x slower
scimark_fft 192 ms 190 ms: 1.01x faster 181 ms: 1.06x faster
scimark_lu 64.8 ms 84.0 ms: 1.30x slower 66.4 ms: 1.03x slower
scimark_monte_carlo 40.4 ms 50.3 ms: 1.25x slower 38.5 ms: 1.05x faster
scimark_sor 75.3 ms 110 ms: 1.46x slower 93.5 ms: 1.24x slower
scimark_sparse_mat_mult 3.03 ms 2.63 ms: 1.15x faster 2.76 ms: 1.10x faster
spectral_norm 68.4 ms 80.9 ms: 1.18x slower 62.8 ms: 1.09x faster
sqlglot_normalize 171 ms 198 ms: 1.16x slower 177 ms: 1.04x slower
sqlglot_optimize 31.5 ms 36.6 ms: 1.16x slower 32.9 ms: 1.05x slower
sqlglot_parse 740 us 941 us: 1.27x slower 731 us: 1.01x faster
sqlglot_transpile 891 us 1.12 ms: 1.26x slower not significant
sqlite_synth 1.59 us 1.63 us: 1.02x slower 1.68 us: 1.05x slower
sympy_expand 231 ms 251 ms: 1.09x slower 239 ms: 1.03x slower
sympy_integrate 10.1 ms 11.3 ms: 1.12x slower 10.3 ms: 1.02x slower
sympy_sum 71.0 ms 75.4 ms: 1.06x slower 71.5 ms: 1.01x slower
sympy_str 134 ms 148 ms: 1.10x slower 137 ms: 1.02x slower
telco 3.55 ms 4.56 ms: 1.28x slower 4.61 ms: 1.30x slower
tomli_loads 1.38 sec 1.59 sec: 1.16x slower 1.44 sec: 1.05x slower
typing_runtime_protocols 97.7 us 102 us: 1.04x slower 98.5 us: 1.01x slower
unpack_sequence 24.9 ns 37.5 ns: 1.51x slower not significant
unpickle 9.12 us 8.55 us: 1.07x faster 9.02 us: 1.01x faster
unpickle_list 3.06 us 2.65 us: 1.16x faster 2.83 us: 1.08x faster
unpickle_pure_python 132 us 183 us: 1.39x slower 139 us: 1.06x slower
xml_etree_parse 93.5 ms 97.2 ms: 1.04x slower 97.0 ms: 1.04x slower
xml_etree_iterparse 68.8 ms 72.9 ms: 1.06x slower 67.6 ms: 1.02x faster
xml_etree_generate 52.8 ms 57.4 ms: 1.09x slower 54.2 ms: 1.03x slower
xml_etree_process 36.0 ms 42.3 ms: 1.18x slower 37.3 ms: 1.04x slower
Geometric mean (ref) 1.14x slower 1.01x slower

For reference, here are our team's "official" benchmarks comparing 3.13.0rc0 vs. 3.12.0 on this platform last week.

@brandtbucher
Copy link
Member

I'm seeing similar results on an x86_64 Mac (from 4% slower to 8% faster):

Benchmark 3.12.4 3.13.0rc1 (old) 3.13.0rc1 (new)
2to3 613 ms 345 ms: 1.78x faster 330 ms: 1.86x faster
async_generators 606 ms 472 ms: 1.29x faster 491 ms: 1.23x faster
async_tree_none 617 ms 460 ms: 1.34x faster 423 ms: 1.46x faster
async_tree_cpu_io_mixed 958 ms 761 ms: 1.26x faster 740 ms: 1.29x faster
async_tree_cpu_io_mixed_tg 919 ms 728 ms: 1.26x faster 705 ms: 1.30x faster
async_tree_eager 127 ms 132 ms: 1.04x slower 128 ms: 1.01x slower
async_tree_eager_cpu_io_mixed 513 ms 502 ms: 1.02x faster 502 ms: 1.02x faster
async_tree_eager_cpu_io_mixed_tg 432 ms 480 ms: 1.11x slower 416 ms: 1.04x faster
async_tree_eager_io 1.46 sec 1.41 sec: 1.04x faster 1.85 sec: 1.27x slower
async_tree_eager_io_tg 1.46 sec 1.35 sec: 1.09x faster 1.70 sec: 1.16x slower
async_tree_eager_memoization 331 ms 310 ms: 1.07x faster 354 ms: 1.07x slower
async_tree_eager_memoization_tg 246 ms 227 ms: 1.09x faster 258 ms: 1.05x slower
async_tree_eager_tg 95.7 ms 89.1 ms: 1.07x faster not significant
async_tree_io 1.40 sec 1.08 sec: 1.30x faster 1.22 sec: 1.15x faster
async_tree_io_tg 1.41 sec 1.11 sec: 1.27x faster 1.13 sec: 1.25x faster
async_tree_memoization 759 ms 582 ms: 1.30x faster 577 ms: 1.32x faster
async_tree_memoization_tg 764 ms 540 ms: 1.42x faster 540 ms: 1.42x faster
async_tree_none_tg 567 ms 400 ms: 1.42x faster 395 ms: 1.43x faster
asyncio_tcp 691 ms 655 ms: 1.05x faster 642 ms: 1.08x faster
asyncio_websockets 415 ms 422 ms: 1.02x slower 407 ms: 1.02x faster
chameleon 8.64 ms 8.85 ms: 1.03x slower 7.73 ms: 1.12x faster
chaos 90.0 ms 71.3 ms: 1.26x faster 67.5 ms: 1.33x faster
comprehensions 23.5 us 19.4 us: 1.21x faster 18.1 us: 1.30x faster
bench_mp_pool 108 ms 93.9 ms: 1.15x faster 85.7 ms: 1.26x faster
bench_thread_pool 1.35 ms 1.13 ms: 1.19x faster 1.08 ms: 1.24x faster
coroutines 33.2 ms 28.2 ms: 1.18x faster 26.7 ms: 1.24x faster
coverage 73.7 ms 105 ms: 1.42x slower 105 ms: 1.42x slower
crypto_pyaes 106 ms 80.2 ms: 1.33x faster 73.3 ms: 1.45x faster
dask 560 ms 490 ms: 1.14x faster 483 ms: 1.16x faster
deepcopy 472 us 420 us: 1.12x faster 394 us: 1.20x faster
deepcopy_reduce 3.87 us 3.72 us: 1.04x faster 3.61 us: 1.07x faster
deepcopy_memo 47.9 us 46.4 us: 1.03x faster 40.4 us: 1.19x faster
deltablue 4.14 ms 3.73 ms: 1.11x faster 3.32 ms: 1.25x faster
django_template 48.1 ms 45.0 ms: 1.07x faster 44.1 ms: 1.09x faster
docutils 3.34 sec 3.25 sec: 1.02x faster 3.10 sec: 1.08x faster
dulwich_log 104 ms 98.0 ms: 1.06x faster 95.0 ms: 1.09x faster
fannkuch 453 ms 490 ms: 1.08x slower 438 ms: 1.03x faster
float 92.6 ms 131 ms: 1.42x slower 83.7 ms: 1.11x faster
create_gc_cycles 1.18 ms 2.85 ms: 2.42x slower 1.43 ms: 1.21x slower
gc_traversal 3.78 ms 6.79 ms: 1.80x slower 3.99 ms: 1.06x slower
generators 39.3 ms 51.6 ms: 1.31x slower 34.7 ms: 1.13x faster
genshi_text 28.2 ms 35.3 ms: 1.25x slower 24.8 ms: 1.14x faster
genshi_xml 63.6 ms 86.8 ms: 1.36x slower 58.6 ms: 1.09x faster
go 157 ms 218 ms: 1.39x slower 148 ms: 1.06x faster
hexiom 7.35 ms 12.8 ms: 1.74x slower 6.70 ms: 1.10x faster
html5lib 82.7 ms 148 ms: 1.79x slower 75.7 ms: 1.09x faster
json_dumps 12.9 ms 19.6 ms: 1.52x slower 12.6 ms: 1.03x faster
json_loads 31.4 us 41.2 us: 1.31x slower 32.5 us: 1.03x slower
logging_format 9.34 us 13.7 us: 1.46x slower 8.96 us: 1.04x faster
logging_silent 110 ns 206 ns: 1.87x slower not significant
logging_simple 8.08 us 9.67 us: 1.20x slower 7.71 us: 1.05x faster
mako 13.4 ms 14.3 ms: 1.06x slower 12.4 ms: 1.08x faster
mdp 2.92 sec 3.34 sec: 1.14x slower 2.86 sec: 1.02x faster
meteor_contest 113 ms 125 ms: 1.11x slower 106 ms: 1.07x faster
nbody 111 ms 118 ms: 1.06x slower 92.2 ms: 1.21x faster
nqueens 105 ms 100 ms: 1.05x faster 94.2 ms: 1.12x faster
pathlib 48.5 ms 58.6 ms: 1.21x slower 43.2 ms: 1.12x faster
pickle 13.3 us 13.6 us: 1.02x slower 12.6 us: 1.06x faster
pickle_dict 33.5 us 40.3 us: 1.20x slower 32.1 us: 1.04x faster
pickle_list 5.30 us 6.14 us: 1.16x slower 5.17 us: 1.03x faster
pickle_pure_python 388 us 443 us: 1.14x slower 325 us: 1.19x faster
pidigits 194 ms 241 ms: 1.24x slower not significant
pprint_safe_repr 939 ms 1.18 sec: 1.26x slower 857 ms: 1.10x faster
pprint_pformat 1.89 sec 2.26 sec: 1.19x slower 1.94 sec: 1.03x slower
pyflate 586 ms 701 ms: 1.20x slower 612 ms: 1.05x slower
python_startup 29.9 ms 40.0 ms: 1.34x slower 26.8 ms: 1.12x faster
python_startup_no_site 26.6 ms 34.1 ms: 1.28x slower 22.5 ms: 1.18x faster
raytrace 396 ms 543 ms: 1.37x slower 300 ms: 1.32x faster
regex_compile 185 ms 174 ms: 1.06x faster 152 ms: 1.22x faster
regex_dna 183 ms 198 ms: 1.09x slower 190 ms: 1.04x slower
regex_effbot 3.95 ms 3.90 ms: 1.01x faster 3.77 ms: 1.05x faster
regex_v8 25.2 ms 26.9 ms: 1.07x slower 25.9 ms: 1.03x slower
richards 54.2 ms 56.7 ms: 1.05x slower 48.5 ms: 1.12x faster
richards_super 62.5 ms 65.8 ms: 1.05x slower 54.9 ms: 1.14x faster
scimark_fft 443 ms 395 ms: 1.12x faster 376 ms: 1.18x faster
scimark_lu 141 ms 136 ms: 1.04x faster 125 ms: 1.12x faster
scimark_monte_carlo 79.6 ms 71.9 ms: 1.11x faster 65.2 ms: 1.22x faster
scimark_sor 148 ms 143 ms: 1.03x faster 132 ms: 1.12x faster
scimark_sparse_mat_mult 6.04 ms 5.35 ms: 1.13x faster 5.60 ms: 1.08x faster
spectral_norm 142 ms 126 ms: 1.13x faster 130 ms: 1.09x faster
sqlglot_normalize 139 ms 145 ms: 1.04x slower 148 ms: 1.06x slower
sqlglot_optimize 69.3 ms 71.5 ms: 1.03x slower 73.4 ms: 1.06x slower
sqlglot_parse 1.64 ms not significant 1.70 ms: 1.04x slower
sqlglot_transpile 2.01 ms 2.19 ms: 1.09x slower not significant
sqlite_synth 3.04 us 3.12 us: 1.03x slower 3.31 us: 1.09x slower
sympy_expand 627 ms not significant 685 ms: 1.09x slower
sympy_integrate 24.1 ms 26.3 ms: 1.09x slower 29.3 ms: 1.22x slower
sympy_sum 205 ms 226 ms: 1.10x slower 230 ms: 1.12x slower
sympy_str 366 ms 392 ms: 1.07x slower 418 ms: 1.14x slower
telco 8.04 ms 10.3 ms: 1.28x slower 10.3 ms: 1.28x slower
tomli_loads 2.97 sec 3.59 sec: 1.21x slower 2.64 sec: 1.12x faster
tornado_http 166 ms 161 ms: 1.03x faster 151 ms: 1.10x faster
typing_runtime_protocols 210 us 184 us: 1.14x faster 189 us: 1.11x faster
unpack_sequence 46.7 ns 45.3 ns: 1.03x faster 38.0 ns: 1.23x faster
unpickle 18.1 us 16.7 us: 1.08x faster 17.7 us: 1.02x faster
unpickle_list 5.31 us 5.06 us: 1.05x faster 5.38 us: 1.01x slower
unpickle_pure_python 269 us 252 us: 1.07x faster 234 us: 1.15x faster
xml_etree_parse 177 ms not significant 175 ms: 1.01x faster
xml_etree_iterparse 134 ms 122 ms: 1.09x faster 119 ms: 1.12x faster
xml_etree_generate 109 ms not significant 105 ms: 1.03x faster
xml_etree_process 75.1 ms not significant 73.0 ms: 1.03x faster
Geometric mean (ref) 1.04x slower 1.08x faster

@hugovk
Copy link
Member

hugovk commented Aug 8, 2024

See also #122832 for some extra benchmarking on the degradation.

@Zheaoli
Copy link
Contributor

Zheaoli commented Aug 21, 2024

About the ARM, do we have benchmarking for Linux aarch64 now ?

@mdboom
Copy link
Contributor

mdboom commented Aug 21, 2024

About the ARM, do we have benchmarking for Linux aarch64 now ?

Yes, thanks to a gracious donation from ARM, Inc. You can see them here

@mdboom
Copy link
Contributor

mdboom commented Aug 26, 2024

I was also able to confirm the improvement on the official builds that @brandtbucher reported above. On that basis, if we needed to cut a release today I think the solution is clear that we just go with the thing that we've measured to be an improvement.

However, it is surprising that turning off an optimization (LTO in this case) results in a speedup -- and I worry that by not understanding why, we just open ourselves up to other random changes going forward. As a first step, I thought I would try to reproduce the result on our own hardware / build environment. If I take the same 3.10.0rc1 commit and build it with:

  1. --enable-optimizations, --with-lto, --with-computed-gotos
  2. --enable-optimizations, --with-computed-gotos

(2) is about 7% slower than (1). This is the opposite effect that we see with @ned-deily's official builds, but it's not surprising that turning off LTO would make things slower. (This is all ARM64 as host and target -- I didn't cross compile and I don't have access to any Intel Mac hardware).

It might make sense to break off the release blocker (how do we get a good build now) and create a new issue to investigate preventing this going forward.

@ned-deily
Copy link
Member

Thanks for the additional data points, @mdboom. After a vacation break, I am back working on this, with a resolution prior to next week's 3.13.0rc2.

@ned-deily ned-deily added build The build process and cross-build and removed type-bug An unexpected behavior, bug, or error release-blocker labels Sep 3, 2024
@ned-deily
Copy link
Member

Extensive testing with the microbenchmarks provided in this issue and in #122832 confirm that (1) there was a significant performance degradation for at least some test cases with the 3.13.0rc1 provided by the python.org installer particularly on Apple Silicon Macs and that (2) that performance degradation appears to be due to the use of the ThinLTO (Link Time Optimization) Python build option (--with-lto defaulting to --with-lto=thin) with recent Apple-supplied LLVM/Clang build tools. As noted earlier in this issue, the degradation was seen with very simple builds of the test case(s) when --with-lto was included. It was subsequently determined through extensive testing that using the original monolithic LTO build option (--with-lto=full) provides a substantial performance boost for these two benchmarks and, in all test configurations shown here, the three build configurations using --with-lto=full always provided the best performance with these microbenchmarks, in some cases, dramatically better.

Since performance can vary greatly due to many factors, it is unreasonable to expect the same level of improvement in more extensive real-world benchmarks but it seems likely that an overall improvement of some magnitude may be seen with most benchmarks. From reading a bit about ThinLTO, it seems one of its biggest advantages was expected to be providing faster incremental builds rather than doing the monolithic optimization for each build. For the python.org installer case, that isn't an advantage since we always build the binary components of the installer from scratch for each release. And the documentation implies that, in most cases, the monolithic LTO would be expected to provide better run-time performance anyway. So all of that seems to indicate that full LTO is the right choice for the installer build in any case.

Moving forward, with the imminent release of 3.13.0rc2, we will use --with-lto=full to build the macOS installer for it and will run these microbenchmarks again before release. It would be great if more extensive benchmarks could be run with the released installer and results reported here to this issue. I will leave it open for that purpose for the immediate future.

While there are still unanswered questions about exactly what it is about ThinLTO that causes the performance degradation in these environments, I am not planning to delve deeper into it since it's not something I have particular expertise in and it seems that using full LTO is a perfectly satisfactory solution for the macOS installer. Feel free to dive in yourself!

Thanks again to everyone who has helped out with reporting and investigation this!

Appended here is a summary of the more pertinent test results while investigating this issue:

Test Cases

Apple Silicon (MacBook Pro 14-inch, Nov 2023, M3, 16GB)

Summary

  test_122580 (version = 3.13.0rc1+-xcode) ran
    1.00 ± 0.02 times faster than test_122580 (version = 3.13.0rc1-lto-full)
    1.04 ± 0.02 times faster than test_122580 (version = 3.13.0rc1+-xcode-beta)
    1.11 ± 0.04 times faster than test_122580 (version = 3.13.0rc1-no-lto)
    1.16 ± 0.02 times faster than test_122580 (version = 3.12.5)
    1.17 ± 0.04 times faster than test_122580 (version = 3.13.0b4)
    1.18 ± 0.02 times faster than test_122580 (version = 3.12.4)
    1.20 ± 0.03 times faster than test_122580 (version = 3.13.0b1)
    1.61 ± 0.04 times faster than test_122580 (version = 3.13.0rc1)

  test_122832 (version = 3.13.0rc1+-xcode-beta) ran
    1.02 ± 0.02 times faster than test_122832 (version = 3.13.0rc1-lto-full)
    1.03 ± 0.01 times faster than test_122832 (version = 3.13.0rc1+-xcode)
    1.07 ± 0.01 times faster than test_122832 (version = 3.12.4)
    1.08 ± 0.01 times faster than test_122832 (version = 3.12.5)
    1.12 ± 0.03 times faster than test_122832 (version = 3.13.0b4)
    1.17 ± 0.01 times faster than test_122832 (version = 3.13.0rc1-no-lto)
    1.28 ± 0.02 times faster than test_122832 (version = 3.13.0b1)
    1.79 ± 0.02 times faster than test_122832 (version = 3.13.0rc1)
Command Mean [s] Min [s] Max [s] Relative
test_122580 (version = 3.12.4) 4.851 ± 0.039 4.797 4.894 1.18 ± 0.02
test_122580 (version = 3.12.5) 4.778 ± 0.030 4.739 4.813 1.16 ± 0.02
test_122580 (version = 3.13.0b1) 4.954 ± 0.089 4.822 5.053 1.20 ± 0.03
test_122580 (version = 3.13.0b4) 4.840 ± 0.141 4.673 4.954 1.17 ± 0.04
test_122580 (version = 3.13.0rc1) 6.627 ± 0.138 6.403 6.751 1.61 ± 0.04
test_122580 (version = 3.13.0rc1-no-lto) 4.577 ± 0.142 4.385 4.740 1.11 ± 0.04
test_122580 (version = 3.13.0rc1-lto-full) 4.128 ± 0.076 4.032 4.225 1.00 ± 0.02
test_122580 (version = 3.13.0rc1+-xcode) 4.121 ± 0.066 4.006 4.169 1.00
test_122580 (version = 3.13.0rc1+-xcode-beta) 4.304 ± 0.073 4.199 4.372 1.04 ± 0.02
Command Mean [s] Min [s] Max [s] Relative
test_122832 (version = 3.12.4) 16.645 ± 0.059 16.588 16.744 1.07 ± 0.01
test_122832 (version = 3.12.5) 16.843 ± 0.064 16.772 16.919 1.08 ± 0.01
test_122832 (version = 3.13.0b1) 19.956 ± 0.248 19.749 20.364 1.28 ± 0.02
test_122832 (version = 3.13.0b4) 17.402 ± 0.370 17.173 18.058 1.12 ± 0.03
test_122832 (version = 3.13.0rc1) 27.907 ± 0.206 27.669 28.202 1.79 ± 0.02
test_122832 (version = 3.13.0rc1-no-lto) 18.313 ± 0.161 18.128 18.558 1.17 ± 0.01
test_122832 (version = 3.13.0rc1-lto-full) 15.988 ± 0.195 15.834 16.294 1.02 ± 0.02
test_122832 (version = 3.13.0rc1+-xcode) 16.057 ± 0.075 15.960 16.160 1.03 ± 0.01
test_122832 (version = 3.13.0rc1+-xcode-beta) 15.599 ± 0.139 15.355 15.692 1.00

Intel (iMac Retina 5K 27-inch, 2019, 3.6 GHz 8-Core Intel Core i9, 40GB)

Summary

  test_122580 (version = 3.13.0rc1+-xcode-beta) ran
    1.08 ± 0.03 times faster than test_122580 (version = 3.13.0rc1+-xcode)
    1.08 ± 0.02 times faster than test_122580 (version = 3.13.0rc1-lto-full)
    1.19 ± 0.02 times faster than test_122580 (version = 3.13.0rc1-no-lto)
    1.20 ± 0.02 times faster than test_122580 (version = 3.13.0b4)
    1.24 ± 0.02 times faster than test_122580 (version = 3.13.0rc1)
    1.43 ± 0.08 times faster than test_122580 (version = 3.13.0b1)
    1.46 ± 0.04 times faster than test_122580 (version = 3.12.4)
    1.47 ± 0.03 times faster than test_122580 (version = 3.12.5)

  test_122832 (version = 3.13.0rc1+-xcode-beta) ran
    1.08 ± 0.04 times faster than test_122832 (version = 3.13.0rc1+-xcode)
    1.11 ± 0.02 times faster than test_122832 (version = 3.13.0rc1-lto-full)
    1.13 ± 0.01 times faster than test_122832 (version = 3.12.5)
    1.13 ± 0.01 times faster than test_122832 (version = 3.12.4)
    1.16 ± 0.02 times faster than test_122832 (version = 3.13.0rc1-no-lto)
    1.20 ± 0.02 times faster than test_122832 (version = 3.13.0b4)
    1.26 ± 0.05 times faster than test_122832 (version = 3.13.0rc1)
    1.35 ± 0.02 times faster than test_122832 (version = 3.13.0b1)
Command Mean [s] Min [s] Max [s] Relative
test_122580 (version = 3.12.4) 10.535 ± 0.193 10.293 10.765 1.46 ± 0.04
test_122580 (version = 3.12.5) 10.630 ± 0.088 10.540 10.753 1.47 ± 0.03
test_122580 (version = 3.13.0b1) 10.328 ± 0.563 9.811 10.947 1.43 ± 0.08
test_122580 (version = 3.13.0b4) 8.681 ± 0.093 8.606 8.832 1.20 ± 0.02
test_122580 (version = 3.13.0rc1) 8.943 ± 0.060 8.867 9.013 1.24 ± 0.02
test_122580 (version = 3.13.0rc1-no-lto) 8.576 ± 0.055 8.497 8.631 1.19 ± 0.02
test_122580 (version = 3.13.0rc1-lto-full) 7.777 ± 0.130 7.613 7.905 1.08 ± 0.02
test_122580 (version = 3.13.0rc1+-xcode) 7.772 ± 0.136 7.655 8.003 1.08 ± 0.03
test_122580 (version = 3.13.0rc1+-xcode-beta) 7.210 ± 0.113 7.088 7.359 1.00
Command Mean [s] Min [s] Max [s] Relative
test_122832 (version = 3.12.4) 33.019 ± 0.182 32.798 33.217 1.13 ± 0.01
test_122832 (version = 3.12.5) 32.950 ± 0.186 32.651 33.158 1.13 ± 0.01
test_122832 (version = 3.13.0b1) 39.386 ± 0.470 38.945 39.939 1.35 ± 0.02
test_122832 (version = 3.13.0b4) 34.851 ± 0.509 34.294 35.525 1.20 ± 0.02
test_122832 (version = 3.13.0rc1) 36.653 ± 1.359 35.594 39.035 1.26 ± 0.05
test_122832 (version = 3.13.0rc1-no-lto) 33.851 ± 0.478 33.435 34.554 1.16 ± 0.02
test_122832 (version = 3.13.0rc1-lto-full) 32.255 ± 0.505 31.864 33.100 1.11 ± 0.02
test_122832 (version = 3.13.0rc1+-xcode) 31.530 ± 1.057 30.654 32.988 1.08 ± 0.04
test_122832 (version = 3.13.0rc1+-xcode-beta) 29.104 ± 0.263 28.850 29.499 1.00

Build configurations

Python Version Apple Clang Build System Deployment Target --with-lto=
3.12.4 clang-1300.0.29.30 x86_64-apple-darwin20.6.0 10.9 thin
3.12.5 clang-1300.0.29.30 x86_64-apple-darwin20.6.0 10.9 thin
3.13.0b1 clang-1300.0.29.30 x86_64-apple-darwin20.6.0 10.9 thin
3.13.0b4 clang-1500.3.9.4 aarch64-apple-darwin23.5.0 10.13 thin
3.13.0rc1 clang-1500.3.9.4 aarch64-apple-darwin23.5.0 10.13 thin
3.13.0rc1-no-lto clang-1500.3.9.4 aarch64-apple-darwin23.6.0 10.13 no
3.13.0rc1-lto-full clang-1500.3.9.4 aarch64-apple-darwin23.6.0 10.13 full
3.13.0rc1+-xcode clang-1500.3.9.4 aarch64-apple-darwin23.6.0 10.13 full
3.13.0rc1+-xcode-beta clang beta aarch64-apple-darwin23.6.0 10.13 full

@ned-deily ned-deily moved this from Todo to Done in Release and Deferred blockers 🚫 Sep 3, 2024
@mdboom
Copy link
Contributor

mdboom commented Sep 3, 2024

Thanks for all your hard work on this, @ned-deily. This finding seems reasonable. I've put the rc2 release date on my calendar and will run it against the full pyperformance suite then to confirm.

@brandtbucher
Copy link
Member

Thanks @ned-deily. I wonder, should the default for --with-lto be full instead of thin? I just assumed that was already the case (and I'm not sure what our benchmarking infrastructure assumes).

In general, if I'm passing --with-lto, I'm expecting the build to take a while (and be as fast as possible in return).

@itamaro
Copy link
Contributor

itamaro commented Sep 3, 2024

What's the default LTO (full or thin) for the Linux builds?

@mdboom
Copy link
Contributor

mdboom commented Sep 3, 2024

IIUC, thin LTO is only supported on llvm/clang, so our gcc-based Linux builds don't use it.

@ned-deily
Copy link
Member

The default for --with-lto is platform-dependent and compiler-dependent. configure.ac has a bunch of code to deal with it (see, for example, starting here). There has also been a fair amount of discussion about lto options (for example, here and here). If somebody wants to revisit that, it would be better to do it in another issue; I'd like to keep this focused on the macOS installer.

@ned-deily ned-deily changed the title Python3.13 Performance Issue on MacOS ARM Python3.13 performance Issue with python.org macOS installers on ARM Macs Sep 3, 2024
@rhettinger
Copy link
Contributor

Just tested 3.13rc2 and the performance issues are solved. Thank you.

@mdboom
Copy link
Contributor

mdboom commented Sep 9, 2024

Just tested 3.13rc2 and the performance issues are solved. Thank you.

Same (on our usual macOS lab machine).

@itamaro
Copy link
Contributor

itamaro commented Sep 9, 2024

closing as fixed, thanks for testing @rhettinger & @mdboom !
please re-open if there's anything left to do here.

@ned-deily
Copy link
Member

FWIW, 3.13.0rc3 differences noted: #124567 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes build The build process and cross-build OS-mac performance Performance or resource usage
Projects
Development

No branches or pull requests

10 participants