Skip to content

Statically link libpython into interpreter (but keep building libpython3.x.so) #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

geofft
Copy link
Collaborator

@geofft geofft commented Apr 19, 2025

Cherry-pick python/cpython#133313 and apply it. The general motivation is performance; see the patch commit message for more details.

@geofft geofft added the platform:linux Specific to the Linux platform label Apr 19, 2025
@geofft geofft force-pushed the interp-static-libpython branch 3 times, most recently from a5b9b77 to d8838a7 Compare April 28, 2025 18:35
@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

No obvious performance impact in any direction, really.

In the following:

  • directory b contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the previous workflow run where I incorrectly had the patch applied after autoconf so it didn't do anything (which makes it a perfect baseline for comparison), I think this workflow run specifically
  • directory c contains cpython-3.13.3-x86_64-unknown-linux-gnu-pgo+lto from the current run
  • and then I ran cp -a c new-build-with-old-binary; cp {b,new-build-with-old-binary}/python/install/bin/python3.13 to get a Python executable that uses libpython.so, i.e, in the new-build-with-old-binary directory, we are loading an unBOLTed libpython into the interpreter, just to see what libpython's performance is like.
  • pystone.py is the last version from before it was removed from cpython sources (git show 61fd70e05027150b21184c7bc9fa8aa0a49f9601^:Lib/test/pystone.py)
  • test machine is an AWS t2.medium running the Ubuntu 24.04 AMI, recently rebooted / not running anything else
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   1.3 ms    [User: 27.7 ms, System: 7.3 ms]
  Range (min … max):    33.1 ms …  41.9 ms    70 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.2 ms ±   0.9 ms    [User: 27.1 ms, System: 8.0 ms]
  Range (min … max):    33.2 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   1.0 ms    [User: 28.4 ms, System: 6.9 ms]
  Range (min … max):    33.6 ms …  38.7 ms    84 runs
 
Summary
  b/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.04 times faster than c/python/install/bin/python3 -c "import ssl"
    1.01 ± 0.05 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 -c "import ssl"'
Benchmark 1: b/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.6 ms ±   0.8 ms    [User: 28.4 ms, System: 7.0 ms]
  Range (min … max):    33.8 ms …  37.6 ms    81 runs
 
Benchmark 2: c/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      35.5 ms ±   0.9 ms    [User: 27.5 ms, System: 7.8 ms]
  Range (min … max):    33.5 ms …  37.8 ms    82 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
  Time (mean ± σ):      36.1 ms ±   0.9 ms    [User: 28.5 ms, System: 7.3 ms]
  Range (min … max):    34.1 ms …  38.3 ms    81 runs
 
Summary
  c/python/install/bin/python3 -c "import ssl" ran
    1.00 ± 0.03 times faster than b/python/install/bin/python3 -c "import ssl"
    1.02 ± 0.04 times faster than new-build-with-old-binary/python/install/bin/python3 -c "import ssl"
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     120.3 ms ±   2.3 ms    [User: 114.0 ms, System: 6.0 ms]
  Range (min … max):   116.4 ms … 125.9 ms    25 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     121.2 ms ±   1.3 ms    [User: 113.8 ms, System: 7.1 ms]
  Range (min … max):   118.7 ms … 124.5 ms    24 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     118.6 ms ±   1.3 ms    [User: 112.8 ms, System: 5.6 ms]
  Range (min … max):   116.5 ms … 120.9 ms    24 runs
 
Summary
  new-build-with-old-binary/python/install/bin/python3 pystone.py ran
    1.01 ± 0.02 times faster than b/python/install/bin/python3 pystone.py
    1.02 ± 0.02 times faster than c/python/install/bin/python3 pystone.py

@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

... which is a little surprising, come to think of it, given that the primary motivation for this was performance.

Just to confirm, the build change is actually doing what we expect:

ubuntu@ip-172-16-0-59:~$ ldd b/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffcd91ba000)
	/home/ubuntu/b/python/install/bin/../lib/libpython3.13.so.1.0 (0x00007a40ed400000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007a40eebd9000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007a40eebd4000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007a40eebcf000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a40ed317000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007a40eebc8000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a40ed000000)
	/lib64/ld-linux-x86-64.so.2 (0x00007a40eebe6000)
ubuntu@ip-172-16-0-59:~$ ldd c/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc61bc9000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ac98f79b000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ac98f796000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ac98f791000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ac98f6a8000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ac98f6a3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ac98f400000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ac98f7a8000)
ubuntu@ip-172-16-0-59:~$ ldd new-build-with-old-binary/python/install/bin/python3
	linux-vdso.so.1 (0x00007ffc041ea000)
	/home/ubuntu/new-build-with-old-binary/python/install/bin/../lib/libpython3.13.so.1.0 (0x000079fe10e00000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000079fe12767000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000079fe12762000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x000079fe1275d000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079fe12674000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000079fe1266d000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079fe10a00000)
	/lib64/ld-linux-x86-64.so.2 (0x000079fe12774000)

I wonder if the problem is my AWS VM has too fast of a filesystem! If I drop caches before each benchmark run, then we see a more defensible couple-percent speedup from this change:

ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 pystone.py' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 pystone.py
  Time (mean ± σ):     285.4 ms ±   4.7 ms    [User: 117.8 ms, System: 15.7 ms]
  Range (min … max):   277.8 ms … 293.3 ms    10 runs
 
Benchmark 2: c/python/install/bin/python3 pystone.py
  Time (mean ± σ):     264.5 ms ±   5.2 ms    [User: 115.0 ms, System: 17.1 ms]
  Range (min … max):   256.3 ms … 269.2 ms    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 pystone.py
  Time (mean ± σ):     280.9 ms ±   7.3 ms    [User: 115.4 ms, System: 15.6 ms]
  Range (min … max):   269.4 ms … 289.6 ms    10 runs
 
Summary
  c/python/install/bin/python3 pystone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 pystone.py
    1.08 ± 0.03 times faster than b/python/install/bin/python3 pystone.py

@geofft
Copy link
Collaborator Author

geofft commented May 2, 2025

We'd also expect to see some speedup on thread-local storage. I can confirm that we are getting the faster TLS model but I can't actually measure a difference.

i've tried doing moar threading with this silly benchmark, but only got a 1-2% performance difference.

import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
    t.submit(pystone.pystones, 1000)
t.shutdown()

However, on the free-threaded variants (cpython-3.13-x86_64-unknown-linux-gnu-freethreaded+pgo+lto), there is a noticeable performance difference:

ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/pystone.py'
Benchmark 1: b/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     252.2 ms ±   2.5 ms    [User: 243.9 ms, System: 7.9 ms]
  Range (min … max):   248.9 ms … 255.8 ms    12 runs
 
Benchmark 2: c/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     227.8 ms ±   1.7 ms    [User: 218.9 ms, System: 8.6 ms]
  Range (min … max):   224.0 ms … 230.0 ms    13 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
  Time (mean ± σ):     241.7 ms ±   3.0 ms    [User: 233.0 ms, System: 8.4 ms]
  Range (min … max):   237.1 ms … 248.4 ms    12 runs
 
Summary
  c/python/install/bin/python3 ~/pystone.py ran
    1.06 ± 0.02 times faster than new-build-with-old-binary/python/install/bin/python3 ~/pystone.py
    1.11 ± 0.01 times faster than b/python/install/bin/python3 ~/pystone.py
ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c,new-build-with-old-binary '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.194 s ±  0.055 s    [User: 3.725 s, System: 0.548 s]
  Range (min … max):    2.111 s …  2.269 s    10 runs
 
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      1.992 s ±  0.037 s    [User: 3.396 s, System: 0.480 s]
  Range (min … max):    1.924 s …  2.037 s    10 runs
 
Benchmark 3: new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.109 s ±  0.037 s    [User: 3.571 s, System: 0.535 s]
  Range (min … max):    2.051 s …  2.167 s    10 runs
 
Summary
  c/python/install/bin/python3 ~/threadstone.py ran
    1.06 ± 0.03 times faster than new-build-with-old-binary/python/install/bin/python3 ~/threadstone.py
    1.10 ± 0.03 times faster than b/python/install/bin/python3 ~/threadstone.py

So I think the change is defensible on those grounds.

(And the non-BOLT libpython isn't slower, and is actually apparently faster, which is maybe concerning....)

Details of thread-local storage code generation differences:

From objdump -dr python/install/lib/libpython3.13.so.1.0, somewhere in the function PyPegen_new_identifier:

  1ac233:	e8 98 be ff ff       	call   1a80d0 <PyUnicode_DecodeUTF8@plt>
			1ac234: R_X86_64_PLT32	PyUnicode_DecodeUTF8-0x4
  1ac238:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  1ac23d:	48 85 c0             	test   %rax,%rax
  1ac240:	0f 84 79 01 00 00    	je     1ac3bf <_PyPegen_new_identifier+0x1af>
  1ac246:	f6 40 20 40          	testb  $0x40,0x20(%rax)
  1ac24a:	74 4c                	je     1ac298 <_PyPegen_new_identifier+0x88>
  1ac24c:	48 8d 3d 6d 71 4c 01 	lea    0x14c716d(%rip),%rdi        # 16733c0 <.got>
			1ac24f: R_X86_64_TLSLD	_Py_tss_tstate-0x4
  1ac253:	e8 08 eb ff ff       	call   1aad60 <__tls_get_addr@plt>
			1ac254: R_X86_64_PLT32	__tls_get_addr@GLIBC_2.3-0x4
  1ac258:	48 8b 80 18 00 00 00 	mov    0x18(%rax),%rax
			1ac25b: R_X86_64_DTPOFF32	_Py_tss_tstate
  1ac25f:	48 8b 78 10          	mov    0x10(%rax),%rdi
  1ac263:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
  1ac268:	4c 89 f6             	mov    %r14,%rsi
  1ac26b:	e8 d0 d2 ff ff       	call   1a9540 <_PyUnicode_InternImmortal@plt>
			1ac26c: R_X86_64_PLT32	_PyUnicode_InternImmortal-0x4
  1ac270:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
  1ac274:	4d 8b 36             	mov    (%r14),%r14
  1ac277:	4c 89 f6             	mov    %r14,%rsi
  1ac27a:	e8 31 bb ff ff       	call   1a7db0 <_PyArena_AddPyObject@plt>
			1ac27b: R_X86_64_PLT32	_PyArena_AddPyObject-0x4

From objdump -dr python/install/bin/python3, same code:

 1c7bdb3:	e8 9e 39 dc ff       	call   1a3f756 <PyUnicode_DecodeUTF8>
 1c7bdb8:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
 1c7bdbd:	48 85 c0             	test   %rax,%rax
 1c7bdc0:	0f 84 d6 d0 2c 00    	je     1f48e9c <_PyPegen_new_identifier.cold+0x108>
 1c7bdc6:	f6 40 20 40          	testb  $0x40,0x20(%rax)
 1c7bdca:	0f 84 c4 cf 2c 00    	je     1f48d94 <_PyPegen_new_identifier.cold>
 1c7bdd0:	64 48 8b 04 25 f8 ff 	mov    %fs:0xfffffffffffffff8,%rax
 1c7bdd7:	ff ff 
 1c7bdd9:	48 8b 78 10          	mov    0x10(%rax),%rdi
 1c7bddd:	4c 8d 74 24 08       	lea    0x8(%rsp),%r14
 1c7bde2:	4c 89 f6             	mov    %r14,%rsi
 1c7bde5:	e8 e0 32 dc ff       	call   1a3f0ca <_PyUnicode_InternImmortal>
 1c7bdea:	48 8b 7b 20          	mov    0x20(%rbx),%rdi
 1c7bdee:	4d 8b 36             	mov    (%r14),%r14
 1c7bdf1:	4c 89 f6             	mov    %r14,%rsi
 1c7bdf4:	e8 e3 1a dc ff       	call   1a3d8dc <_PyArena_AddPyObject>

Note the function call to __tls_get_addr in the shared library and the direct reference to 8 bytes below %fs in the binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:linux Specific to the Linux platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant