Skip to content

GC performance regression in free threaded build #132917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gptsarthak opened this issue Apr 25, 2025 · 13 comments
Closed

GC performance regression in free threaded build #132917

gptsarthak opened this issue Apr 25, 2025 · 13 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@gptsarthak
Copy link

gptsarthak commented Apr 25, 2025

Bug report

Bug description:

I've identified a significant performance regression when using Python's free-threaded mode with shared list appends. In my test case, simply appending to a shared list causes a 10-15x performance decrease compared to normal Python operation.

Test Case:

import itertools
import time

def performance_test(n_options=5, n_items=5, iterations=50):
    list = []
    
    def expensive_operation():
        # Create lists of tuples
        data = []
        for _ in range(n_options):
            data.append([(f"a{i}", f"b{i}") for i in range(n_items)])
        
        # Generate all combinations and create result tuples
        results = []
        for combo in itertools.product(*data):
            result = tuple((x[0], x[1], f"name_{i}") for i, x in enumerate(combo))
            results.append(result)
        
        # Commenting the following line solves the performance regression in free-threaded mode
        list.append(results)
        return results

    start = time.time()
    for _ in range(iterations):
        result = expensive_operation()
    
    duration = time.time() - start
    print(f"n_options={n_options}, n_items={n_items}, iterations={iterations}")
    print(f"Time: {duration:.4f}s, Combinations: {len(result)}")
    return duration

if __name__ == "__main__":
    print("Python Performance Regression Test")
    print("-" * 40)
    performance_test()

Results:

  • Standard Python3.13: 0.1290s
  • Free-threaded Python3.13t: 2.1643s
  • Free-threaded Python 3.14.0a7: 2.1923s
  • Free-threaded Python3.13t with list.append commented out: 0.1332s

The regression appears to be caused by contention on the per-list locks and reference count fields when appending to a shared list in free-threaded mode.

CPython versions tested on:

3.14

Operating systems tested on:

Linux

Linked PRs

@gptsarthak gptsarthak added the type-bug An unexpected behavior, bug, or error label Apr 25, 2025
@picnixz picnixz added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading labels Apr 25, 2025
@colesbury
Copy link
Contributor

Yes, this is expected and not a bug.

@picnixz picnixz added pending The issue will be closed if no feedback is provided and removed type-bug An unexpected behavior, bug, or error labels Apr 25, 2025
@corona10
Copy link
Member

Closing this issue since this is expected.

@corona10 corona10 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2025
@colesbury
Copy link
Contributor

To add a bit more context: if you have an inherently serial operation, like incrementing a counter or appending to a list, it's going to be slower when run with multiple threads on multiple CPUs than if you have some sort of coarse grained serialization (like the GIL) or run it on a single CPU.

@srinivasreddy
Copy link
Contributor

srinivasreddy commented Apr 25, 2025

@colesbury I have been reading GIL removal pep. And this issue has been closed as "not planned" , Do we have any plans to fix performance partially at least, especially in the cases of list.append(results) ?

@colesbury
Copy link
Contributor

No

@terryjreedy
Copy link
Member

Is this issue specific to .append or does it apply to .extend and slice insertions?

Is/should this be documented, warned against, somewhere other than here?

@colesbury
Copy link
Contributor

I don't think we need to specifically warn against it any more than we would warn against building a list via successive list.insert(0, item).

This is an inherently serial operation. If you try to perform a serial operation from multiple CPUs, you introduce lots of extra communication and everything gets slower. That's normal and not really specific to Python, just like quadratic behavior of some data structures is normal and not specific to Python.

@gptsarthak
Copy link
Author

@colesbury Thank you for the response. I understand the general principle about serial operations being slower with multiple threads. However, I want to clarify that my MRE doesn't use any threading at all - it's entirely single-threaded code.

The performance regression I'm observing is occurring in single-threaded code simply by running it in free-threaded Python mode. This suggests that there's a significant overhead for thread safety mechanisms even when no actual threading is used.

This seems important to document, since while experienced concurrent programmers will understand the underlying reasons, many Python users might not expect their unthreaded code to suddenly run 10-15x slower.

Would it be worth mentioning this behavior in the free-threading documentation - that even single-threaded code using mutable shared data structures may experience performance regressions due to the thread safety overhead?

Also, is there any scope for optimizing free-threaded mode under these conditions in future releases?

@colesbury
Copy link
Contributor

Sorry, I didn't pay close enough attention to the repro and focused on "shared list appends". Thanks for following up.

This appears to be some sort of performance regression related to GC. The list.append(results) isn't really slow itself, it just holds onto the generated combinations, which leads to the GC being scheduled. Without that list.append(), the objects are freed after each call to expensive_operation().

You can verify this by adding a import gc; gc.disable() at the top.

@colesbury colesbury reopened this Apr 26, 2025
@colesbury colesbury removed the pending The issue will be closed if no feedback is provided label Apr 26, 2025
@colesbury colesbury changed the title 15x Performance regression with shared list appends in free-threaded mode GC performance regression in free threaded build Apr 26, 2025
@colesbury
Copy link
Contributor

cc @nascheme, in case you want to take a look at this

@picnixz picnixz added the type-bug An unexpected behavior, bug, or error label Apr 26, 2025
@nascheme
Copy link
Member

Yeah, the difference in performance is almost purely due to the different behavior of the cyclic GC. With the list.append() you are quickly increasing the number of alive container objects and the GC runs many times. The free-threaded build doesn't have a generational collector and so it needs to do a full collection each time the increment in live objects threshold is exceeded. I need to investigate further to confirm but I suspect since you are using tuples, the default build GC is untracking those (since they don't contain containers) and so the need to perform full GC collections is greatly reduced. Needs more investigation to confirm that's all that's happening and there is not some other issue.

nascheme added a commit that referenced this issue May 5, 2025
For the free-threaded build, check the process resident set size (RSS)
increase before triggering a full automatic garbage collection.  If the RSS
has not increased 10% since the last collection then it is deferred.
Yhg1s added a commit that referenced this issue May 6, 2025
Fix data race detected by tsan
(https://github.com/python/cpython/actions/runs/14857021107/job/41712717208?pr=133502):
young.count can be modified by other threads even while the gcstate is
locked.

This is the simplest fix to (potentially) unblock beta 1, although this
particular code path seems like it could just be an atomic swap followed by
an atomic add, without having the lock at all.
nascheme added a commit to nascheme/cpython that referenced this issue May 7, 2025
nascheme added a commit to nascheme/cpython that referenced this issue May 8, 2025
nascheme added a commit that referenced this issue May 8, 2025
On Linux, use /proc/self/status for mem usage info.  Using smaps_rollup is quite a lot slower and
we can get the similar info from /proc/self/status.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 8, 2025
…133544)

On Linux, use /proc/self/status for mem usage info.  Using smaps_rollup is quite a lot slower and
we can get the similar info from /proc/self/status.
(cherry picked from commit 751db4e)

Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>
nascheme added a commit that referenced this issue May 8, 2025
… (gh-133718)

On Linux, use /proc/self/status for mem usage info.  Using smaps_rollup is quite a lot slower and
we can get the similar info from /proc/self/status.
(cherry picked from commit 751db4e)

Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 27, 2025
…thonGH-134692)

(cherry picked from commit ac539e7)

Co-authored-by: Kumar Aditya <kumaraditya@python.org>
kumaraditya303 added a commit that referenced this issue May 27, 2025
…H-134692) (#134802)

gh-132917: fix data race on `last_mem` in free-threading gc  (GH-134692)
(cherry picked from commit ac539e7)

Co-authored-by: Kumar Aditya <kumaraditya@python.org>
@colesbury
Copy link
Contributor

@colesbury
Copy link
Contributor

Oh, seems to be fixed in 3.14 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants