Statically link libpython into interpreter (but keep building libpython3.x.so) #592

geofft · 2025-04-19T20:37:18Z

Cherry-pick python/cpython#133313 and apply it. The general motivation is performance; see the patch commit message for more details.

…on3.x.so)

geofft · 2025-05-02T16:59:09Z

No obvious performance impact in any direction, really.

In the following:

directory b contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the previous workflow run where I incorrectly had the patch applied after autoconf so it didn't do anything (which makes it a perfect baseline for comparison), I think this workflow run specifically
directory c contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the current run
and then I ran cp -a c new-build-with-old-binary; cp {b,new-build-with-old-binary}/python/install/bin/python3.13 to get a Python executable that uses libpython.so, i.e, in the new-build-with-old-binary directory, we are loading an unBOLTed libpython into the interpreter, just to see what libpython's performance is like.
pystone.py is the last version from before it was removed from cpython sources (git show 61fd70e05027150b21184c7bc9fa8aa0a49f9601^:Lib/test/pystone.py)
test machine is an AWS t2.medium running the Ubuntu 24.04 AMI, recently rebooted / not running anything else

ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   1.3 ms    [User: 27.7 ms, System: 7.3 ms]
  Range (min … max):    33.1 ms …  41.9 ms    70 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   0.9 ms    [User: 27.1 ms, System: 8.0 ms]
  Range (min … max):    33.2 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   1.0 ms    [User: 28.4 ms, System: 6.9 ms]
  Range (min … max):    33.6 ms …  38.7 ms    84 runs
 
Summary
  b/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.04 times faster than c/python/install/bin/python3 -c "import ssl"
    1.01 ± 0.05 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.6 ms ±   0.8 ms    [User: 28.4 ms, System: 7.0 ms]
  Range (min … max):    33.8 ms …  37.6 ms    81 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   0.9 ms    [User: 27.5 ms, System: 7.8 ms]
  Range (min … max):    33.5 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      36.1 ms ±   0.9 ms    [User: 28.5 ms, System: 7.3 ms]
  Range (min … max):    34.1 ms …  38.3 ms    81 runs
 
Summary
  c/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.03 times faster than b/python/install/bin/python3 -c "import ssl"
    1.02 ± 0.04 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     120.3 ms ±   2.3 ms    [User: 114.0 ms, System: 6.0 ms]
  Range (min … max):   116.4 ms … 125.9 ms    25 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     121.2 ms ±   1.3 ms    [User: 113.8 ms, System: 7.1 ms]
  Range (min … max):   118.7 ms … 124.5 ms    24 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     118.6 ms ±   1.3 ms    [User: 112.8 ms, System: 5.6 ms]
  Range (min … max):   116.5 ms … 120.9 ms    24 runs
 
Summary
  new-build-with-old-binary/python/install/bin/python3 pystone.py ran
    1.01 ± 0.02 times faster than b/python/install/bin/python3 pystone.py
    1.02 ± 0.02 times faster than c/python/install/bin/python3 pystone.py

geofft · 2025-05-02T17:14:00Z

... which is a little surprising, come to think of it, given that the primary motivation for this was performance.

Just to confirm, the build change is actually doing what we expect:

ubuntu@ip-172-16-0-59:~$ ldd b/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffcd91ba000)
	/home/ubuntu/b/python/install/bin/../lib/libpython3.13.so.1.0 (0x00007a40ed400000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007a40eebd9000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007a40eebd4000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007a40eebcf000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a40ed317000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007a40eebc8000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a40ed000000)
	/lib64/ld-linux-x86-64.so.2 (0x00007a40eebe6000)
ubuntu@ip-172-16-0-59:~$ ldd c/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc61bc9000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ac98f79b000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ac98f796000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ac98f791000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ac98f6a8000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ac98f6a3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ac98f400000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ac98f7a8000)
ubuntu@ip-172-16-0-59:~$ ldd new-build-with-old-binary/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc041ea000)
	/home/ubuntu/new-build-with-old-binary/python/install/bin/../lib/libpython3.13.so.1.0 (0x000079fe10e00000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000079fe12767000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000079fe12762000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x000079fe1275d000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079fe12674000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000079fe1266d000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079fe10a00000)
	/lib64/ld-linux-x86-64.so.2 (0x000079fe12774000)

I wonder if the problem is my AWS VM has too fast of a filesystem! If I drop caches before each benchmark run, then we see a more defensible couple-percent speedup from this change:

ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     285.4 ms ±   4.7 ms    [User: 117.8 ms, System: 15.7 ms]
  Range (min … max):   277.8 ms … 293.3 ms    10 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     264.5 ms ±   5.2 ms    [User: 115.0 ms, System: 17.1 ms]
  Range (min … max):   256.3 ms … 269.2 ms    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     280.9 ms ±   7.3 ms    [User: 115.4 ms, System: 15.6 ms]
  Range (min … max):   269.4 ms … 289.6 ms    10 runs
 
Summary
  c/python/install/bin/python3 pystone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 pystone.py
    1.08 ± 0.03 times faster than b/python/install/bin/python3 pystone.py

geofft · 2025-05-02T17:49:58Z

We'd also expect to see some speedup on thread-local storage. I can confirm that we are getting the faster TLS model but I can't actually measure a difference.

i've tried doing moar threading with this silly benchmark, but only got a 1-2% performance difference.

import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
    t.submit(pystone.pystones, 1000)
t.shutdown()

However, on the free-threaded variants (cpython-3.13-x86_64-unknown-linux-gnu-freethreaded+pgo+lto), there is a noticeable performance difference:

ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/pystone.py'
Benchmark 1: b/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     252.2 ms ±   2.5 ms    [User: 243.9 ms, System: 7.9 ms]
  Range (min … max):   248.9 ms … 255.8 ms    12 runs
 
Benchmark 2: c/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     227.8 ms ±   1.7 ms    [User: 218.9 ms, System: 8.6 ms]
  Range (min … max):   224.0 ms … 230.0 ms    13 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     241.7 ms ±   3.0 ms    [User: 233.0 ms, System: 8.4 ms]
  Range (min … max):   237.1 ms … 248.4 ms    12 runs
 
Summary
  c/python/install/bin/python3 ~/pystone.py ran
    1.06 ± 0.02 times faster than new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
    1.11 ± 0.01 times faster than b/python/install/bin/python3 ~/pystone.py
ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.194 s ±  0.055 s    [User: 3.725 s, System: 0.548 s]
  Range (min … max):    2.111 s …  2.269 s    10 runs
 
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      1.992 s ±  0.037 s    [User: 3.396 s, System: 0.480 s]
  Range (min … max):    1.924 s …  2.037 s    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.109 s ±  0.037 s    [User: 3.571 s, System: 0.535 s]
  Range (min … max):    2.051 s …  2.167 s    10 runs
 
Summary
  c/python/install/bin/python3 ~/threadstone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
    1.10 ± 0.03 times faster than b/python/install/bin/python3 ~/threadstone.py

So I think the change is defensible on those grounds.

(And the non-BOLT libpython isn't slower, and is actually apparently faster, which is maybe concerning....)

Details of thread-local storage code generation differences:

From objdump -dr python/install/lib/libpython3.13.so.1.0, somewhere in the function PyPegen_new_identifier:

  1ac233:	e8 98 be ff ff       	call   1a80d0 <PyUnicode_DecodeUTF8@plt>
			1ac234: R_X86_64_PLT32	PyUnicode_DecodeUTF8-0x4
  1ac238:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  1ac23d:	48 85 c0             	test   %rax,%rax
  1ac240:	0f 84 79 01 00 00    	je     1ac3bf <_PyPegen_new_identifier+0x1af>
  1ac246:	f6 40 20 40          	testb  $0x40,0x20(%rax)
  1ac24a:	74 4c                	je     1ac298 <_PyPegen_new_identifier+0x88>
  1ac24c:	48 8d 3d 6d 71 4c 01 	lea    0x14c716d(%rip),%rdi        # 16733c0 <.got>
			1ac24f: R_X86_64_TLSLD	_Py_tss_tstate-0x4
  1ac253:	e8 08 eb ff ff       	call   1aad60 <__tls_get_addr@plt>
			1ac254: R_X86_64_PLT32	__tls_get_addr@GLIBC_2.3-0x4
  1ac258:	48 8b 80 18 00 00 00 	mov    0x18(%rax),%rax
			1ac25b: R_X86_64_DTPOFF32	_Py_tss_tstate
  1ac25f:	48 8b 78 10          	mov    0x10(%rax),%rdi
  1ac263:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
  1ac268:	4c 89 f6             	mov    %r14,%rsi
  1ac26b:	e8 d0 d2 ff ff       	call   1a9540 <_PyUnicode_InternImmortal@plt>
			1ac26c: R_X86_64_PLT32	_PyUnicode_InternImmortal-0x4
  1ac270:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
  1ac274:	4d 8b 36             	mov    (%r14),%r14
  1ac277:	4c 89 f6             	mov    %r14,%rsi
  1ac27a:	e8 31 bb ff ff       	call   1a7db0 <_PyArena_AddPyObject@plt>
			1ac27b: R_X86_64_PLT32	_PyArena_AddPyObject-0x4

From objdump -dr python/install/bin/python3, same code:

 1c7bdb3:	e8 9e 39 dc ff       	call   1a3f756 <PyUnicode_DecodeUTF8>
 1c7bdb8:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
 1c7bdbd:	48 85 c0             	test   %rax,%rax
 1c7bdc0:	0f 84 d6 d0 2c 00    	je     1f48e9c <_PyPegen_new_identifier.cold+0x108>
 1c7bdc6:	f6 40 20 40          	testb  $0x40,0x20(%rax)
 1c7bdca:	0f 84 c4 cf 2c 00    	je     1f48d94 <_PyPegen_new_identifier.cold>
 1c7bdd0:	64 48 8b 04 25 f8 ff 	mov    %fs:0xfffffffffffffff8,%rax
 1c7bdd7:	ff ff 
 1c7bdd9:	48 8b 78 10          	mov    0x10(%rax),%rdi
 1c7bddd:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
 1c7bde2:	4c 89 f6             	mov    %r14,%rsi
 1c7bde5:	e8 e0 32 dc ff       	call   1a3f0ca <_PyUnicode_InternImmortal>
 1c7bdea:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
 1c7bdee:	4d 8b 36             	mov    (%r14),%r14
 1c7bdf1:	4c 89 f6             	mov    %r14,%rsi
 1c7bdf4:	e8 e3 1a dc ff       	call   1a3d8dc <_PyArena_AddPyObject>

Note the function call to __tls_get_addr in the shared library and the direct reference to 8 bytes below %fs in the binary.

geofft added the platform:linux Specific to the Linux platform label Apr 19, 2025

geofft force-pushed the interp-static-libpython branch 3 times, most recently from a5b9b77 to d8838a7 Compare April 28, 2025 18:35

Statically link libpython into interpreter (but keep building libpyth…

2af08c0

…on3.x.so)

geofft force-pushed the interp-static-libpython branch from d8838a7 to 2af08c0 Compare April 28, 2025 21:13

geofft mentioned this pull request May 2, 2025

cpython ./configure's -pthread detection is misbehaving on some variants #599

Open

geofft mentioned this pull request May 2, 2025

Enable building both an interpreter that statically links libpython and a shared library too python/cpython#133312

Open

geofft marked this pull request as ready for review May 2, 2025 22:06

geofft requested a review from zanieb May 2, 2025 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

geofft commented Apr 19, 2025 •

edited

Loading

geofft commented May 2, 2025

geofft commented May 2, 2025

geofft commented May 2, 2025

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

Are you sure you want to change the base?

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

Conversation

geofft commented Apr 19, 2025 • edited Loading

geofft commented May 2, 2025

geofft commented May 2, 2025

geofft commented May 2, 2025

geofft commented Apr 19, 2025 •

edited

Loading