gh-134761: Use deferred reference counting for `threading` concurrency primitives #134762

ZeroIntensity · 2025-05-26T23:47:35Z

This scales much better. Using this script:

import threading
import time

lock = threading.Lock()

def scale():
    a = time.perf_counter()
    for _ in range(10000000):
        lock.locked()
    b = time.perf_counter()
    print(b - a, "s")

threads = [threading.Thread(target=scale) for _ in range(8)]
for thread in threads:
    thread.start()

With this applied, I see similar performance to if lock was a local variable:

0.3701289139999062 s
0.40727080300075613 s
0.41241479399923264 s
0.4155945310003517 s
0.44201267799962807 s
0.4484649369996987 s
0.4601175060006426 s
0.46210344200062536 s

Prior to this change, I see:

3.425866439999936 s
3.5953266010001244 s
3.6094701500001065 s
3.667731437000157 s
4.458146230000011 s
4.466017671000145 s
4.499206339000011 s
4.50090869099995 s

Issue: threading primitives are subject to reference count contention #134761

📚 Documentation preview 📚: https://cpython-previews--134762.org.readthedocs.build/

ZeroIntensity · 2025-05-26T23:53:17Z

I'd appreciate feedback from anyone here. There's no precedent in the standard library for using deferred reference counting at the Python level, so we're in some uncharted waters here. I don't think anyone would have been relying on when threading objects are deallocated, so that shouldn't be an issue. Are there other tradeoffs to using DRC that I'm not aware of?

corona10 · 2025-05-27T00:00:35Z

Doc/library/sys.rst

@@ -290,6 +290,25 @@ always available. Unless explicitly noted otherwise, all variables are read-only
      This function is specific to CPython.  The exact output format is not
      defined here, and may change.

+.. function:: _defer_refcount(op)


Did not take a look overall but not sure it is worth to to expose it through the documentation since this is not a public api.

Yeah, I'm a little bit on the fence about it. We do this already for some other functions in sys: _is_gil_enabled, _is_interned, _is_immortal, and a few others. Basically, it's a way to expose implementation details that are useful at a Python level.

Overall, I think this is worth documenting. There's no other way to improve scaling from Python right now, and it's similar to using an unstable C API (in the sense that it could be removed in any version).

What about documenting this in https://github.com/python/cpython/tree/main/InternalDocs somewhere? I fully agree with removing this from the normal documentation (for the reasons mentioned), but as a user/developer/tester I find the python api to _defer_refcount (and others like _is_immortal) useful and some documentation is always nice.

corona10 · 2025-05-27T00:05:56Z

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

corona10 · 2025-05-27T00:06:55Z

cc @vstinner @colesbury

ZeroIntensity · 2025-05-27T00:10:12Z

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

Reiterating my previous comment: I'm +0.1 for documenting it. I'll happily yield if others are strongly opposed.

corona10 · 2025-05-27T00:20:40Z

I am -1 on documentation, we have bunch of private Python API that are related to C API but we don't expose it the Python documentation since we don't want people depends on it.

Fidget-Spinner · 2025-05-27T06:30:16Z

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

Same. I truly think we shouldn't be exposing anything that relies on our implementation of refcounting. It's causing a hard enough time for other implementations already. And even causes problems for ourselves when we try to optimize away refcounting.

ZeroIntensity · 2025-05-27T09:55:21Z

I truly think we shouldn't be exposing anything that relies on our implementation of refcounting. It's causing a hard enough time for other implementations already.

In general, I agree. I've removed the documentation note from this PR.

That said, I do think it's important that we do expose and document some (unstable) APIs for working with the current reference counting implementation, because otherwise people will try to hack it together themselves. For example, Nanobind was using silly hacks with immortalization before we exposed some better APIs for messing with reference counting. I think that's much more dangerous.

vstinner · 2025-05-27T10:11:31Z

Modules/_threadmodule.c

@@ -950,6 +950,7 @@ lock_new_impl(PyTypeObject *type)
    if (self == NULL) {
        return NULL;
    }
+    _PyObject_SetDeferredRefcount((PyObject *)self);


Can you move the call after initializing self->lock? Same remark for rlock_new_impl() below.

vstinner · 2025-05-27T10:16:50Z

Python/sysmodule.c

@@ -2653,6 +2654,23 @@ sys__is_gil_enabled_impl(PyObject *module)
 #endif
 }

+/*[clinic input]
+sys._defer_refcount -> bool


Please document it in Doc/library/sys.rst. If it "should not be used", add a clear explanation why it should not be used there. If it's not documented, the lack of documentation doesn't prevent users from using it.

See Ken and Donghee's comments.

Well, I disageee with them. IMO we should document sys functions.

I would only be supportive of documenting this if we were allowed to change it in a minor version with no deprecation period. My understanding is that PyUnstable in the C API allows that, but exposing to sys._x means we are stuck with at least 2 deprecation cycle and recommended 5 deprecation cycles. Users should not rely on this function in the first place except in very specific scenarios.

One way to "bypass" this is make the function a no-op in future versions of Python once we solve this issue altogether. But I don't know what users will rely on by then so I'm a bit worried.

Hmm, I thought we were allowed to change sys._x things in minor versions without deprecation. If not, that's a problem.

I think we're getting a bit hung up on this point. We can add or remove the documentation for _defer_refcount later, it's not too important. Does everything else look fine here?

Well I am preparing a better proposal for this approach. Give me hours.

Ok, cool. Feel free to cc me on it.

Something we also need to consider is whether we want to address this for 3.14. Should this general idea be considered a bugfix or a feature?

See: #134819

I think that this is improvement rather than bug fix.

vstinner · 2025-05-27T10:17:42Z

With this applied, I see similar performance to if lock was a local variable:

I don't understand well your benchmark. Can you show numbers before/after this change?

ZeroIntensity · 2025-05-27T10:23:47Z

I don't understand well your benchmark. Can you show numbers before/after this change?

Updated the description. The benchmark is just trying to measure the amount of time spent accessing the object. See also the issue.

corona10 · 2025-05-27T12:18:36Z

@Fidget-Spinner Out of curiosity can we just apply deferred refcount with tracing objects that have a possilbity of high concentration?
Then we don't have to modify objects in this way.

ZeroIntensity · 2025-05-27T13:20:06Z

From my understanding, the issue with automatically applying DRC is that it only lets an object be deallocated during a garbage collection, which breaks code that relies on __del__ (or a weakref finalizer) being called when the object's reference count hits zero.

Maybe we could enable it for objects without __del__, and then disable it if weakref.finalize is called on it?

Fidget-Spinner · 2025-05-27T13:25:22Z

@Fidget-Spinner Out of curiosity can we just apply deferred refcount with tracing objects that have a possilbity of high concentration? Then we don't have to modify objects in this way.

You mean like track objects that are frequently shared and apply it to those objects? It would be quite easy to implement, the only problem is like Peter said we can't defer finalizers of things that take up a huge amount of memory and rely on immediate reclamation.

corona10 · 2025-05-27T13:59:01Z

If we can apply it to the conditional case, as Peter said, behavior change would not be an issue. OTAH, memory consumption also would not be an issue if we can provide a disable tracing option where the device has low memory. Most of the production machines have enough memory, so it would not be an issue, IIUC. (Please let me know if it will consume huuuuge memory)

corona10 · 2025-05-27T14:02:01Z

Side note, the reason I am talking about tracing is that I am now starting to fear that we will attach the deferred ref API all over the codebase due to performance reasons. It would be chaos for us..

ZeroIntensity · 2025-05-27T14:09:42Z

It's generally a one-line change to an object's initializer, why would that be chaos?

We could also shift the decision to the user and add a generic threading.hot, which would enable deferred reference counting (or maybe immortalize non-GC types), but document it as "might do nothing" for forward compatibility. That would act as a safeguard against people constantly asking for deferred reference counting on their favorite objects.

corona10 · 2025-05-27T14:14:55Z

It's generally a one-line change to an object's initializer, why would that be chaos?

No, you are now starting to measure microbenchmarks and then applying them. Can you guarantee that the case is limited?
I can bet that we will find more cases that is the high contention.
How about 3rd party projects? those projects should measure the high contention and then change the code?

A better solution would be that we don't have to care about such a case, and the interpreter applies it automatically. I am fine with just applying for this case, but let's discuss a better solution and think about what we can provide.

mpage · 2025-05-28T03:05:21Z

You can work around the scaling issues in the repro pretty easily by passing the lock as an argument to scale:

import threading
import time

lock = threading.Lock()

def scale(lock):
    a = time.perf_counter()
    for _ in range(10000000):
        lock.locked()
    b = time.perf_counter()
    print(b - a, "s")

threads = [threading.Thread(target=scale, args=(lock,)) for _ in range(8)]
for thread in threads:
    thread.start()

I think we should be very cautious about exposing implementation details like deferred reference counting.

ZeroIntensity · 2025-05-28T12:38:38Z

I think we should be very cautious about exposing implementation details like deferred reference counting.

FWIW, I don't think we're exposing it, per se. Things will "just work" from the user's perspective. I'd really rather the interpreter do it than force weird implementation details, like using locals.

vstinner

Maybe you can extract Modules/_threadmodule.c changes into a separated PR since these changes don't require adding sys._defer_refcount().

ZeroIntensity · 2025-05-28T15:18:48Z

Ok. _thread is still a documented module--should it be private there too?

colesbury · 2025-05-28T17:55:20Z

I don't really see the motivation for these particular classes. It seems unlikely to me that someone is checking lock.locked() simultaneously from multiple threads without also acquiring and releasing the lock, which itself will be a bottleneck. Maybe this makes sense for threading.Event()?

I agree with @mpage that we should be cautious about exposing implementation details. The motivation here doesn't seem sufficient enough to me to overcome that caution.

ZeroIntensity · 2025-05-28T18:06:04Z

lock.locked() was just an example to measure the reference count contention and not the internal PyMutex contention. Event is fixed here already too. In practice, people will be doing with lock or whatever, but that will still have the same refcount contention.

I agree with @mpage that we should be cautious about exposing implementation details. The motivation here doesn't seem sufficient enough to me to overcome that caution.

I don't think we're exposing any actual implementation details, right? The reference count is still hidden to threading users. sys._defer_refcount is a different story, but I'm going to move that anyway.

ZeroIntensity added 2 commits May 26, 2025 19:31

Use deferred reference counting for threading primitives.

e02137e

Document and test sys._defer_refcount()

6093acb

ZeroIntensity requested a review from colesbury May 26, 2025 23:47

ZeroIntensity requested a review from ericsnowcurrently as a code owner May 26, 2025 23:47

bedevere-app bot added the awaiting review label May 26, 2025

bedevere-app bot mentioned this pull request May 26, 2025

threading primitives are subject to reference count contention #134761

Open

Add blurb.

478d482

ZeroIntensity added the topic-free-threading label May 26, 2025

corona10 reviewed May 27, 2025

View reviewed changes

ZeroIntensity added 2 commits May 26, 2025 20:23

Remove sys._defer_refcount() documentation.

f0fc73c

Remove stray newline change.

d407ad9

vstinner reviewed May 27, 2025

View reviewed changes

Move call to _PyObject_SetDeferredRefcount()

c750d92

Fidget-Spinner mentioned this pull request May 28, 2025

Automatically defer heavily shared objects in free-threading. #134821

Open

corona10 mentioned this pull request May 28, 2025

Add sys.set_object_tags() and sys.get_object_tags() APIs for debugging and experimental Use #134819

Open

vstinner reviewed May 28, 2025

View reviewed changes

mostafaammer added this to lavitaconnect@MOSTAFAAMMER May 31, 2025

ZeroIntensity mentioned this pull request Jun 9, 2025

gh-134821: Automatically enable deferred reference counting on shared objects #134880

Closed

Uh oh!

gh-134761: Use deferred reference counting for threading concurrency primitives #134762

Are you sure you want to change the base?

gh-134761: Use deferred reference counting for threading concurrency primitives #134762

Conversation

ZeroIntensity commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity commented May 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corona10 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corona10 commented May 27, 2025

Uh oh!

ZeroIntensity commented May 27, 2025

Uh oh!

corona10 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented May 27, 2025

Uh oh!

ZeroIntensity commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZeroIntensity May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented May 27, 2025

Uh oh!

ZeroIntensity commented May 27, 2025

Uh oh!

corona10 commented May 27, 2025

Uh oh!

ZeroIntensity commented May 27, 2025

Uh oh!

Fidget-Spinner commented May 27, 2025

Uh oh!

corona10 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corona10 commented May 27, 2025

Uh oh!

ZeroIntensity commented May 27, 2025

Uh oh!

corona10 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpage commented May 28, 2025

Uh oh!

ZeroIntensity commented May 28, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

gh-134761: Use deferred reference counting for `threading` concurrency primitives #134762

gh-134761: Use deferred reference counting for `threading` concurrency primitives #134762

ZeroIntensity commented May 26, 2025 •

edited

Loading

corona10 commented May 27, 2025 •

edited

Loading

corona10 commented May 27, 2025 •

edited

Loading

ZeroIntensity May 28, 2025 •

edited

Loading

corona10 commented May 27, 2025 •

edited

Loading

corona10 commented May 27, 2025 •

edited

Loading