Skip to content

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

geofft
Copy link
Collaborator

@geofft geofft commented Apr 19, 2025

Cherry-pick python/cpython#133313 and apply it. The general motivation is performance; see the patch commit message for more details.

@geofft geofft added the platform:linux Specific to the Linux platform label Apr 19, 2025
@geofft geofft force-pushed the interp-static-libpython branch 4 times, most recently from d8838a7 to 2af08c0 Compare April 28, 2025 21:13
@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

No obvious performance impact in any direction, really.

In the following:

  • directory b contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the previous workflow run where I incorrectly had the patch applied after autoconf so it didn't do anything (which makes it a perfect baseline for comparison), I think this workflow run specifically
  • directory c contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the current run
  • and then I ran cp -a c new-build-with-old-binary; cp {b,new-build-with-old-binary}/python/install/bin/python3.13 to get a Python executable that uses libpython.so, i.e, in the new-build-with-old-binary directory, we are loading an unBOLTed libpython into the interpreter, just to see what libpython's performance is like.
  • pystone.py is the last version from before it was removed from cpython sources (git show 61fd70e05027150b21184c7bc9fa8aa0a49f9601^:Lib/test/pystone.py)
  • test machine is an AWS t2.medium running the Ubuntu 24.04 AMI, recently rebooted / not running anything else
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   1.3 ms    [User: 27.7 ms, System: 7.3 ms]
  Range (min … max):    33.1 ms …  41.9 ms    70 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   0.9 ms    [User: 27.1 ms, System: 8.0 ms]
  Range (min … max):    33.2 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   1.0 ms    [User: 28.4 ms, System: 6.9 ms]
  Range (min … max):    33.6 ms …  38.7 ms    84 runs
 
Summary
  b/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.04 times faster than c/python/install/bin/python3 -c "import ssl"
    1.01 ± 0.05 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.6 ms ±   0.8 ms    [User: 28.4 ms, System: 7.0 ms]
  Range (min … max):    33.8 ms …  37.6 ms    81 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   0.9 ms    [User: 27.5 ms, System: 7.8 ms]
  Range (min … max):    33.5 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      36.1 ms ±   0.9 ms    [User: 28.5 ms, System: 7.3 ms]
  Range (min … max):    34.1 ms …  38.3 ms    81 runs
 
Summary
  c/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.03 times faster than b/python/install/bin/python3 -c "import ssl"
    1.02 ± 0.04 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     120.3 ms ±   2.3 ms    [User: 114.0 ms, System: 6.0 ms]
  Range (min … max):   116.4 ms … 125.9 ms    25 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     121.2 ms ±   1.3 ms    [User: 113.8 ms, System: 7.1 ms]
  Range (min … max):   118.7 ms … 124.5 ms    24 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     118.6 ms ±   1.3 ms    [User: 112.8 ms, System: 5.6 ms]
  Range (min … max):   116.5 ms … 120.9 ms    24 runs
 
Summary
  new-build-with-old-binary/python/install/bin/python3 pystone.py ran
    1.01 ± 0.02 times faster than b/python/install/bin/python3 pystone.py
    1.02 ± 0.02 times faster than c/python/install/bin/python3 pystone.py

@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

... which is a little surprising, come to think of it, given that the primary motivation for this was performance.

Just to confirm, the build change is actually doing what we expect:

ubuntu@ip-172-16-0-59:~$ ldd b/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffcd91ba000)
	/home/ubuntu/b/python/install/bin/../lib/libpython3.13.so.1.0 (0x00007a40ed400000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007a40eebd9000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007a40eebd4000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007a40eebcf000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a40ed317000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007a40eebc8000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a40ed000000)
	/lib64/ld-linux-x86-64.so.2 (0x00007a40eebe6000)
ubuntu@ip-172-16-0-59:~$ ldd c/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc61bc9000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ac98f79b000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ac98f796000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ac98f791000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ac98f6a8000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ac98f6a3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ac98f400000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ac98f7a8000)
ubuntu@ip-172-16-0-59:~$ ldd new-build-with-old-binary/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc041ea000)
	/home/ubuntu/new-build-with-old-binary/python/install/bin/../lib/libpython3.13.so.1.0 (0x000079fe10e00000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000079fe12767000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000079fe12762000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x000079fe1275d000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079fe12674000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000079fe1266d000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079fe10a00000)
	/lib64/ld-linux-x86-64.so.2 (0x000079fe12774000)

I wonder if the problem is my AWS VM has too fast of a filesystem! If I drop caches before each benchmark run, then we see a more defensible couple-percent speedup from this change:

ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     285.4 ms ±   4.7 ms    [User: 117.8 ms, System: 15.7 ms]
  Range (min … max):   277.8 ms … 293.3 ms    10 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     264.5 ms ±   5.2 ms    [User: 115.0 ms, System: 17.1 ms]
  Range (min … max):   256.3 ms … 269.2 ms    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     280.9 ms ±   7.3 ms    [User: 115.4 ms, System: 15.6 ms]
  Range (min … max):   269.4 ms … 289.6 ms    10 runs
 
Summary
  c/python/install/bin/python3 pystone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 pystone.py
    1.08 ± 0.03 times faster than b/python/install/bin/python3 pystone.py

@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

We'd also expect to see some speedup on thread-local storage. I can confirm that we are getting the faster TLS model but I can't actually measure a difference.

i've tried doing moar threading with this silly benchmark, but only got a 1-2% performance difference.

import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
    t.submit(pystone.pystones, 1000)
t.shutdown()

However, on the free-threaded variants (cpython-3.13-x86_64-unknown-linux-gnu-freethreaded+pgo+lto), there is a noticeable performance difference:

ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/pystone.py'
Benchmark 1: b/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     252.2 ms ±   2.5 ms    [User: 243.9 ms, System: 7.9 ms]
  Range (min … max):   248.9 ms … 255.8 ms    12 runs
 
Benchmark 2: c/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     227.8 ms ±   1.7 ms    [User: 218.9 ms, System: 8.6 ms]
  Range (min … max):   224.0 ms … 230.0 ms    13 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     241.7 ms ±   3.0 ms    [User: 233.0 ms, System: 8.4 ms]
  Range (min … max):   237.1 ms … 248.4 ms    12 runs
 
Summary
  c/python/install/bin/python3 ~/pystone.py ran
    1.06 ± 0.02 times faster than new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
    1.11 ± 0.01 times faster than b/python/install/bin/python3 ~/pystone.py
ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.194 s ±  0.055 s    [User: 3.725 s, System: 0.548 s]
  Range (min … max):    2.111 s …  2.269 s    10 runs
 
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      1.992 s ±  0.037 s    [User: 3.396 s, System: 0.480 s]
  Range (min … max):    1.924 s …  2.037 s    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.109 s ±  0.037 s    [User: 3.571 s, System: 0.535 s]
  Range (min … max):    2.051 s …  2.167 s    10 runs
 
Summary
  c/python/install/bin/python3 ~/threadstone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
    1.10 ± 0.03 times faster than b/python/install/bin/python3 ~/threadstone.py

So I think the change is defensible on those grounds.

(And the non-BOLT libpython isn't slower, and is actually apparently faster, which is maybe concerning....)

Details of thread-local storage code generation differences:

From objdump -dr python/install/lib/libpython3.13.so.1.0, somewhere in the function PyPegen_new_identifier:

  1ac233:	e8 98 be ff ff       	call   1a80d0 <PyUnicode_DecodeUTF8@plt>
			1ac234: R_X86_64_PLT32	PyUnicode_DecodeUTF8-0x4
  1ac238:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  1ac23d:	48 85 c0             	test   %rax,%rax
  1ac240:	0f 84 79 01 00 00    	je     1ac3bf <_PyPegen_new_identifier+0x1af>
  1ac246:	f6 40 20 40          	testb  $0x40,0x20(%rax)
  1ac24a:	74 4c                	je     1ac298 <_PyPegen_new_identifier+0x88>
  1ac24c:	48 8d 3d 6d 71 4c 01 	lea    0x14c716d(%rip),%rdi        # 16733c0 <.got>
			1ac24f: R_X86_64_TLSLD	_Py_tss_tstate-0x4
  1ac253:	e8 08 eb ff ff       	call   1aad60 <__tls_get_addr@plt>
			1ac254: R_X86_64_PLT32	__tls_get_addr@GLIBC_2.3-0x4
  1ac258:	48 8b 80 18 00 00 00 	mov    0x18(%rax),%rax
			1ac25b: R_X86_64_DTPOFF32	_Py_tss_tstate
  1ac25f:	48 8b 78 10          	mov    0x10(%rax),%rdi
  1ac263:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
  1ac268:	4c 89 f6             	mov    %r14,%rsi
  1ac26b:	e8 d0 d2 ff ff       	call   1a9540 <_PyUnicode_InternImmortal@plt>
			1ac26c: R_X86_64_PLT32	_PyUnicode_InternImmortal-0x4
  1ac270:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
  1ac274:	4d 8b 36             	mov    (%r14),%r14
  1ac277:	4c 89 f6             	mov    %r14,%rsi
  1ac27a:	e8 31 bb ff ff       	call   1a7db0 <_PyArena_AddPyObject@plt>
			1ac27b: R_X86_64_PLT32	_PyArena_AddPyObject-0x4

From objdump -dr python/install/bin/python3, same code:

 1c7bdb3:	e8 9e 39 dc ff       	call   1a3f756 <PyUnicode_DecodeUTF8>
 1c7bdb8:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
 1c7bdbd:	48 85 c0             	test   %rax,%rax
 1c7bdc0:	0f 84 d6 d0 2c 00    	je     1f48e9c <_PyPegen_new_identifier.cold+0x108>
 1c7bdc6:	f6 40 20 40          	testb  $0x40,0x20(%rax)
 1c7bdca:	0f 84 c4 cf 2c 00    	je     1f48d94 <_PyPegen_new_identifier.cold>
 1c7bdd0:	64 48 8b 04 25 f8 ff 	mov    %fs:0xfffffffffffffff8,%rax
 1c7bdd7:	ff ff 
 1c7bdd9:	48 8b 78 10          	mov    0x10(%rax),%rdi
 1c7bddd:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
 1c7bde2:	4c 89 f6             	mov    %r14,%rsi
 1c7bde5:	e8 e0 32 dc ff       	call   1a3f0ca <_PyUnicode_InternImmortal>
 1c7bdea:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
 1c7bdee:	4d 8b 36             	mov    (%r14),%r14
 1c7bdf1:	4c 89 f6             	mov    %r14,%rsi
 1c7bdf4:	e8 e3 1a dc ff       	call   1a3d8dc <_PyArena_AddPyObject>

Note the function call to __tls_get_addr in the shared library and the direct reference to 8 bytes below %fs in the binary.

@zanieb
Copy link
Member

zanieb commented May 7, 2025

The riscv failures look like

validating dist/cpython-3.13.3-riscv64-unknown-linux-gnu-debug-20250506T2015.tar.zst
Error: errors found
  error: python/install/bin/python3.13d library load of libatomic.so.1 does not have system link build annotation
  error: python/install/lib/libpython3.13d.so.1.0 library load of libatomic.so.1 does not have system link build annotation

# and some of the defaults are wrong or unwanted, so we have to fix
# those manually.
# TODO(geofft): This cannot be the only thing.
patch -p1 -i "${ROOT}/patch-python-configure-cross-assume-pthread.patch"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass an option to configure that set this to yes instead of patching?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Should this be limited to cross-compiles?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Should this be limited to cross-compiles?)

The patch as written is limited to cross-compiles, in that it changes the default behavior when ./configure is running in cross-compile mode and therefore (thinks that) it cannot run test programs. It leaves the configure test in place with the current behavior for the non-cross-compiling case. See #599.

Can we pass an option to configure that set this to yes instead of patching?

Looks like yes, with the test ./configure script mentioned in #599:

$ ./configure cross_compiling=yes ac_cv_pthread=yes
[...]
checking whether gcc accepts -pthread... (cached) yes

My vague intuition is that a patch is the right thing here because this patch is actually suitable for upstream—it is a better guess in 2025 that a compiler supports -pthread than that it doesn't (just as it's a better guess that a compiler supports C99 strftime, etc.). But I don't have a strong opinion. If you prefer, yes, we can do

if [ -n "${CROSS_COMPILING}" ]; then
  EXTRA_CONFIGURE_FLAGS="${EXTRA_CONFIGURE_FLAGS} ac_cv_pthread=yes"
fi

(As mentioned in #599 we can also pass cross_compiling=yes on the command line and drop the existing patch-force-cross-compile.patch.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think

if [ -n "${CROSS_COMPILING}" ]; then
  EXTRA_CONFIGURE_FLAGS="${EXTRA_CONFIGURE_FLAGS} ac_cv_pthread=yes"
fi

is generally better than a patch, since it removes a layer of indirection and is a common ./configure pattern. It also makes it much more obvious that it's a cross-compile only change. I'm also supportive of you trying to contribute the patch upstream. I don't feel strongly here, but I was surprised / confused.

@geofft
Copy link
Collaborator Author

geofft commented May 7, 2025

Yeah, I think this needs a small tweak to the validation script, let me do that. I'm going to tag this PR as riscv only to avoid rerunning CI on the successful arches, for future reference here's the current Actions run with those results.

geofft added 2 commits May 7, 2025 17:35
Also, switch to using cross_compiling=yes instead of patching
./configure in place, which allows us to move rerunning autoconf to
right before running ./configure, avoiding the risk of patching
./configure.ac too late.

See astral-sh#599.
@geofft geofft force-pushed the interp-static-libpython branch from f073cef to 317d430 Compare May 7, 2025 21:41
@zanieb
Copy link
Member

zanieb commented May 7, 2025

Nice! Those last few fixes look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch:riscv64 arch:x86_64 platform:linux Specific to the Linux platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants