-
-
Notifications
You must be signed in to change notification settings - Fork 182
Statically link libpython into interpreter (but keep building libpython3.x.so) #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a5b9b77
to
d8838a7
Compare
d8838a7
to
2af08c0
Compare
No obvious performance impact in any direction, really. In the following:
|
... which is a little surprising, come to think of it, given that the primary motivation for this was performance. Just to confirm, the build change is actually doing what we expect:
I wonder if the problem is my AWS VM has too fast of a filesystem! If I drop caches before each benchmark run, then we see a more defensible couple-percent speedup from this change:
|
We'd also expect to see some speedup on thread-local storage. I can confirm that we are getting the faster TLS model but I can't actually measure a difference. i've tried doing moar threading with this silly benchmark, but only got a 1-2% performance difference. import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
t.submit(pystone.pystones, 1000)
t.shutdown() However, on the free-threaded variants (
So I think the change is defensible on those grounds. (And the non-BOLT libpython isn't slower, and is actually apparently faster, which is maybe concerning....) Details of thread-local storage code generation differences:From 1ac233: e8 98 be ff ff call 1a80d0 <PyUnicode_DecodeUTF8@plt>
1ac234: R_X86_64_PLT32 PyUnicode_DecodeUTF8-0x4
1ac238: 48 89 44 24 08 mov %rax,0x8(%rsp)
1ac23d: 48 85 c0 test %rax,%rax
1ac240: 0f 84 79 01 00 00 je 1ac3bf <_PyPegen_new_identifier+0x1af>
1ac246: f6 40 20 40 testb $0x40,0x20(%rax)
1ac24a: 74 4c je 1ac298 <_PyPegen_new_identifier+0x88>
1ac24c: 48 8d 3d 6d 71 4c 01 lea 0x14c716d(%rip),%rdi # 16733c0 <.got>
1ac24f: R_X86_64_TLSLD _Py_tss_tstate-0x4
1ac253: e8 08 eb ff ff call 1aad60 <__tls_get_addr@plt>
1ac254: R_X86_64_PLT32 __tls_get_addr@GLIBC_2.3-0x4
1ac258: 48 8b 80 18 00 00 00 mov 0x18(%rax),%rax
1ac25b: R_X86_64_DTPOFF32 _Py_tss_tstate
1ac25f: 48 8b 78 10 mov 0x10(%rax),%rdi
1ac263: 4c 8d 74 24 08 lea 0x8(%rsp),%r14
1ac268: 4c 89 f6 mov %r14,%rsi
1ac26b: e8 d0 d2 ff ff call 1a9540 <_PyUnicode_InternImmortal@plt>
1ac26c: R_X86_64_PLT32 _PyUnicode_InternImmortal-0x4
1ac270: 48 8b 7b 20 mov 0x20(%rbx),%rdi
1ac274: 4d 8b 36 mov (%r14),%r14
1ac277: 4c 89 f6 mov %r14,%rsi
1ac27a: e8 31 bb ff ff call 1a7db0 <_PyArena_AddPyObject@plt>
1ac27b: R_X86_64_PLT32 _PyArena_AddPyObject-0x4 From 1c7bdb3: e8 9e 39 dc ff call 1a3f756 <PyUnicode_DecodeUTF8>
1c7bdb8: 48 89 44 24 08 mov %rax,0x8(%rsp)
1c7bdbd: 48 85 c0 test %rax,%rax
1c7bdc0: 0f 84 d6 d0 2c 00 je 1f48e9c <_PyPegen_new_identifier.cold+0x108>
1c7bdc6: f6 40 20 40 testb $0x40,0x20(%rax)
1c7bdca: 0f 84 c4 cf 2c 00 je 1f48d94 <_PyPegen_new_identifier.cold>
1c7bdd0: 64 48 8b 04 25 f8 ff mov %fs:0xfffffffffffffff8,%rax
1c7bdd7: ff ff
1c7bdd9: 48 8b 78 10 mov 0x10(%rax),%rdi
1c7bddd: 4c 8d 74 24 08 lea 0x8(%rsp),%r14
1c7bde2: 4c 89 f6 mov %r14,%rsi
1c7bde5: e8 e0 32 dc ff call 1a3f0ca <_PyUnicode_InternImmortal>
1c7bdea: 48 8b 7b 20 mov 0x20(%rbx),%rdi
1c7bdee: 4d 8b 36 mov (%r14),%r14
1c7bdf1: 4c 89 f6 mov %r14,%rsi
1c7bdf4: e8 e3 1a dc ff call 1a3d8dc <_PyArena_AddPyObject> Note the function call to |
Cherry-pick python/cpython#133313 and apply it. The general motivation is performance; see the patch commit message for more details.