-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Instance method performance issue with free threading (3.13rc1) #123619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem here is reference count contention. Each thread has to acquire the lock for Unfortunately, we can't enable it for that many objects, because it extends the lifetime of objects. Many rely on things like I've spun up a toy build with an API to enable deferred reference counting inside
And here it is with it enabled for
|
I don't think we should expose a Python-level API for deferred rc. It makes other implementations worse off and exposes yet another implementation detail to the user. Exposing an unstable C-level API still makes sense. |
Yeah, which is why it would make sense to give that tradeoff to the user. Anyways, something like |
I've created #123635 to add an API for this. I'm leaving it as a draft for now, though. |
For developers unfamiliar with the C API (me), a Python-level API would be much more accessible, perhaps something like a class attribute |
I don't think we have a way to denote unstable Python APIs, do we?
Oh! That's good to hear. Adding deferred RC to |
In my opinion, we should avoid providing implementation details to the C API as much as possible. cc @vstinner |
That's why my PR puts it in |
Here are the results from 32 cores and 32 threads:
The cost of contention is extremely high. PEP 703 mentions, "A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently." However, it seems that while functions have Deferred Reference Counting, methods do not. Python users might not realize this when using free-threading Python, and if they don't read the C API documentation, they may not know that methods can become a bottleneck at scale and they need to enable it manually using C API. PEP 703 also mentions that lists will have a fast path. However, based on my tests, lists are slow, while tuples perform very quickly. A benchmark on global list access: from threading import Thread
import time
gl = [1]
def getg(i):
return gl[0]
def getl(i):
return [1][0]
def bench_run(runner):
s = time.monotonic_ns()
for i in range(500000):
runner(i)
d = time.monotonic_ns() - s
print(f"{d:,}")
def bench_run_parallel(count, runner):
tl = []
for i in range(count):
t = Thread(target=bench_run, args=[runner])
tl.append(t)
t.start()
for t in tl:
t.join()
if __name__ == '__main__':
print("no threading global list")
bench_run(getg)
print("\nthreading global list")
bench_run_parallel(6, getg)
print("\nno threading local list")
bench_run(getl)
print("\nthreading local list")
bench_run_parallel(6, getl)
Use tuple as global list(
|
Ah, this might be a bug: per PEP 703, deferred reference counting is enabled for the actual If we really want to expose a Python API, let's add something unspecific to deferred RC -- something like |
@ZeroIntensity I might have found another bug. In my global list benchmark, if I use |
Both are slow on main branch (main:d8c0fe1). I'm not sure if this is intentional, but if the contention case is expected, then it might not be a bug. Python 3.14.0a0 experimental free-threading build (heads/main:d8c0fe1, Sep 18 2024, 09:23:28) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
|
I wouldn't expect any speedup on a build with the GIL enabled. Since there's no I/O here, the GIL is never released when waiting for something, and since only one thread can execute at a time, the result is just extra overhead introduced by threading. |
@ZeroIntensity Thanks for keeping an eye on my issue. From what I understand, the
|
Main branch slowness is intended. We disabled immortalization to introduce deferred refcounting. Deferred refcounting is not yet done so expect slowness on main |
Sorry, I thought you meant that you did a fresh build of the main branch (with the GIL enabled). I wasn't aware of the DRC problem on main, but yeah, that would explain it. |
…ence counting (GH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>
Now that we have an official API for DRC, I'm going to close this. Thanks, @Yiling-J! |
Looks like I'll have to learn some C and CPython to make my cache library scalable. Honestly I still think the best approach would be for CPython to optimize these by default, scalability is also important for free-threading Python. It would be even better if there were an official document telling developers what is scalable for free-threading CPython. @ZeroIntensity |
You shouldn't need to learn much C, you can use the function pretty easily with ctypes: import ctypes
enable_drc = ctypes.pythonapi.PyUnstable_Object_EnableDeferredRefcount
enable_drc.argtypes = (ctypes.py_object,)
enable_drc.restype = ctypes.c_int
enable_drc(whatever_you_want) |
@ZeroIntensity I ran another round of benchmark using
|
I'm not sure what the state of DRC is on main, it might still be disabled/suboptimal.
Note that you're not measuring throughput here, you're measuring time taken. |
That's true but throughput can easily be derived from the time taken, and if the throughput aligns, the time should as well. My main point is that the issue still persists—the performance doesn't match my expectations. |
Sort of, but there's not much actual computing happening here--the interpreter is just trying to pass an object around and mess with its reference count, which will basically always be faster on the main thread because of biased reference counting. So, as a result, the bottleneck is the reference count operation, which isn't really a fair benchmark. This isn't a problem limited to Python; in basically any language with multithreading, you need some sort of atomic operation to use data concurrently, and if you only measure the cost of that against single-threaded code, then of course it will seem slower. |
… reference counting (pythonGH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>
… reference counting (pythonGH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
Processor: 2.6 GHz 6-Core Intel Core i7
Python 3.13.0rc1 experimental free-threading build (main, Sep 3 2024, 09:59:18) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
When using 6 threads, the instance method becomes approximately 10 times slower, while the function method shows comparable performance.
Python 3.12.5 (main, Sep 2 2024, 15:11:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
CPython versions tested on:
3.13
Operating systems tested on:
macOS
Linked PRs
The text was updated successfully, but these errors were encountered: