-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
GC performance regression in free threaded build #132917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, this is expected and not a bug. |
Closing this issue since this is expected. |
To add a bit more context: if you have an inherently serial operation, like incrementing a counter or appending to a list, it's going to be slower when run with multiple threads on multiple CPUs than if you have some sort of coarse grained serialization (like the GIL) or run it on a single CPU. |
@colesbury I have been reading GIL removal pep. And this issue has been closed as "not planned" , Do we have any plans to fix performance partially at least, especially in the cases of |
No |
Is this issue specific to .append or does it apply to .extend and slice insertions? Is/should this be documented, warned against, somewhere other than here? |
I don't think we need to specifically warn against it any more than we would warn against building a list via successive This is an inherently serial operation. If you try to perform a serial operation from multiple CPUs, you introduce lots of extra communication and everything gets slower. That's normal and not really specific to Python, just like quadratic behavior of some data structures is normal and not specific to Python. |
@colesbury Thank you for the response. I understand the general principle about serial operations being slower with multiple threads. However, I want to clarify that my MRE doesn't use any threading at all - it's entirely single-threaded code. The performance regression I'm observing is occurring in single-threaded code simply by running it in free-threaded Python mode. This suggests that there's a significant overhead for thread safety mechanisms even when no actual threading is used. This seems important to document, since while experienced concurrent programmers will understand the underlying reasons, many Python users might not expect their unthreaded code to suddenly run 10-15x slower. Would it be worth mentioning this behavior in the free-threading documentation - that even single-threaded code using mutable shared data structures may experience performance regressions due to the thread safety overhead? Also, is there any scope for optimizing free-threaded mode under these conditions in future releases? |
Sorry, I didn't pay close enough attention to the repro and focused on "shared list appends". Thanks for following up. This appears to be some sort of performance regression related to GC. The You can verify this by adding a |
cc @nascheme, in case you want to take a look at this |
Yeah, the difference in performance is almost purely due to the different behavior of the cyclic GC. With the |
For the free-threaded build, check the process resident set size (RSS) increase before triggering a full automatic garbage collection. If the RSS has not increased 10% since the last collection then it is deferred.
Fix data race detected by tsan (https://github.com/python/cpython/actions/runs/14857021107/job/41712717208?pr=133502): young.count can be modified by other threads even while the gcstate is locked. This is the simplest fix to (potentially) unblock beta 1, although this particular code path seems like it could just be an atomic swap followed by an atomic add, without having the lock at all.
On Linux, use /proc/self/status for mem usage info. Using smaps_rollup is quite a lot slower and we can get the similar info from /proc/self/status.
…thonGH-134692) (cherry picked from commit ac539e7) Co-authored-by: Kumar Aditya <kumaraditya@python.org>
This is fixed in main now with:
@nascheme - should the fixes be backported to 3.14? |
Oh, seems to be fixed in 3.14 as well. |
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
I've identified a significant performance regression when using Python's free-threaded mode with shared list appends. In my test case, simply appending to a shared list causes a 10-15x performance decrease compared to normal Python operation.
Test Case:
Results:
The regression appears to be caused by contention on the per-list locks and reference count fields when appending to a shared list in free-threaded mode.
CPython versions tested on:
3.14
Operating systems tested on:
Linux
Linked PRs
last_mem
in free-threading gc #134692last_mem
in free-threading gc (GH-134692) #134802The text was updated successfully, but these errors were encountered: