Instance method performance issue with free threading (3.13rc1) #123619

Yiling-J · 2024-09-03T02:19:05Z

Bug report

Bug description:

from threading import Thread
import time

class Cache:
    def get(self, i):
        return i

def get(i):
    return i

client = Cache()

def bench_run(runner):
    s = time.monotonic_ns()
    for i in range(500000):
        runner(i)
    d = time.monotonic_ns() - s
    print(f"{d:,}")


def bench_run_parallel(count, runner):
    tl = []
    for i in range(count):
        t = Thread(target=bench_run, args=[runner])
        tl.append(t)
        t.start()

    for t in tl:
        t.join()

if __name__ == '__main__':
    print("no threading class")
    bench_run(client.get)
    print("\nthreading class")
    bench_run_parallel(6, client.get)

    print("\nno threading function")
    bench_run(get)
    print("\nthreading function")
    bench_run_parallel(6, get)

Processor: 2.6 GHz 6-Core Intel Core i7

PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0rc1

PYTHON_GIL=0 python benchmarks/run.py

Python 3.13.0rc1 experimental free-threading build (main, Sep 3 2024, 09:59:18) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
34,626,126

threading class
313,298,115
318,659,929
319,484,384
321,183,006
320,650,898
321,589,085

no threading function
29,890,615

threading function
31,336,844
31,927,592
32,087,732
33,612,093
32,611,520
33,897,321

When using 6 threads, the instance method becomes approximately 10 times slower, while the function method shows comparable performance.

Python 3.12.5 (main, Sep 2 2024, 15:11:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
27,588,484

threading class
79,383,431
113,182,075
111,067,075
107,679,440
63,825,808
92,679,231

no threading function
26,200,075

threading function
76,489,523
105,636,550
121,436,736
106,408,462
104,572,564
70,686,101

CPython versions tested on:

3.13

Operating systems tested on:

macOS

Linked PRs

gh-123619: Add an unstable C API function for enabling deferred reference counting #123635

The text was updated successfully, but these errors were encountered:

ZeroIntensity · 2024-09-03T10:55:05Z

The problem here is reference count contention. Each thread has to acquire the lock for client.get to manage its reference count, which is causing a slowdown. The reason this is only apparent for methods and not functions is that by default, Python will enable deferred reference counting (moving reference counts to the garbage collection cycle) for function objects, but not method objects.

Unfortunately, we can't enable it for that many objects, because it extends the lifetime of objects. Many rely on things like __del__ to be called right after something goes out of scope (even though they shouldn't). However, adding an API to explicitly do this might be worthwhile (cc @colesbury)

I've spun up a toy build with an API to enable deferred reference counting inside gc, and that significantly helps the performance. On my end, here's what the benchmark is without deferred reference counts:

no threading class
265,006,848

threading class
482,098,951
482,566,769
486,775,336
488,567,058
493,592,728
494,866,087

And here it is with it enabled for client and client.get:

no threading class
235,182,978

threading class
290,714,004
299,124,424
302,431,219
302,047,238
304,002,698
305,301,686

Fidget-Spinner · 2024-09-03T11:57:20Z

I don't think we should expose a Python-level API for deferred rc. It makes other implementations worse off and exposes yet another implementation detail to the user. Exposing an unstable C-level API still makes sense.

ZeroIntensity · 2024-09-03T12:15:30Z

It makes other implementations worse off

Yeah, which is why it would make sense to give that tradeoff to the user.

Anyways, something like PyUnstable_Object_SetDeferredRefcount sounds reasonable. Though, we do need to account for the assertions inside _PyObject_SetDeferredRefcount and raise an exception if they aren't met instead.

ZeroIntensity · 2024-09-03T13:37:11Z

I've created #123635 to add an API for this. I'm leaving it as a draft for now, though.

Yiling-J · 2024-09-04T01:17:47Z

For developers unfamiliar with the C API (me), a Python-level API would be much more accessible, perhaps something like a class attribute __unsafe_deferred_rc__ = True. While some code might require __del__ to be called immediately, I think most code does not. BTW method is also mentioned in PEP 703 Deferred Reference Counting

ZeroIntensity · 2024-09-04T01:36:16Z

I don't think we have a way to denote unstable Python APIs, do we?

BTW method is also mentioned in PEP 703 Deferred Reference Counting

Oh! That's good to hear. Adding deferred RC to client.get probably did nothing then, but it should have helped client.

corona10 · 2024-09-05T23:39:53Z

In my opinion, we should avoid providing implementation details to the C API as much as possible.
Deferred reference counting is one of the implementation details of CPython itself, not Python.

cc @vstinner

ZeroIntensity · 2024-09-05T23:43:27Z

That's why my PR puts it in PyUnstable. Deferred RC is handy for speeding up free-threading, as shown in this case -- it would be nice if we let users take advantage of it.

Yiling-J · 2024-09-06T03:46:53Z

Here are the results from 32 cores and 32 threads:

No threading class:
43,692,595

Threading class:
3,148,864,844
3,214,454,065
3,354,286,011
...

No threading function:
38,085,539

Threading function:
83,932,004
87,322,433
87,612,766
...

The cost of contention is extremely high. PEP 703 mentions, "A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently." However, it seems that while functions have Deferred Reference Counting, methods do not. Python users might not realize this when using free-threading Python, and if they don't read the C API documentation, they may not know that methods can become a bottleneck at scale and they need to enable it manually using C API.

PEP 703 also mentions that lists will have a fast path. However, based on my tests, lists are slow, while tuples perform very quickly.

A benchmark on global list access:

from threading import Thread
import time

gl = [1]

def getg(i):
    return gl[0]

def getl(i):
    return [1][0]

def bench_run(runner):
    s = time.monotonic_ns()
    for i in range(500000):
        runner(i)
    d = time.monotonic_ns() - s
    print(f"{d:,}")


def bench_run_parallel(count, runner):
    tl = []
    for i in range(count):
        t = Thread(target=bench_run, args=[runner])
        tl.append(t)
        t.start()

    for t in tl:
        t.join()

if __name__ == '__main__':
    print("no threading global list")
    bench_run(getg)
    print("\nthreading global list")
    bench_run_parallel(6, getg)

    print("\nno threading local list")
    bench_run(getl)
    print("\nthreading local list")
    bench_run_parallel(6, getl)

no threading global list
49,554,181

threading global list
592,800,310
593,596,259
595,105,848
600,674,214
604,215,793
605,159,285

no threading local list
59,097,610

threading local list
63,532,560
64,400,105
65,673,563
65,818,533
67,569,054
68,702,698

Use tuple as global list(gl = (1,)):

no threading global list
46,807,489

threading global list
48,496,945
49,355,437
49,688,850
49,730,008
50,377,615
49,933,182

ZeroIntensity · 2024-09-06T04:08:16Z

Ah, this might be a bug: per PEP 703, deferred reference counting is enabled for the actual function objects stored on a type, but not for the bound methods returned by the descriptors -- I'm not sure that was intentional.

If we really want to expose a Python API, let's add something unspecific to deferred RC -- something like sys.thread_hot() could work. That could enable deferred RC as an implementation detail, but would be allowed to change internally in the future; the only key factor that can't change is that it marks an object as "hot" across threads (I think that's pretty universal, and not exposing any implementation details).

Yiling-J · 2024-09-06T10:08:02Z

@ZeroIntensity I might have found another bug. In my global list benchmark, if I use gl = tuple([1]), still a tuple but performance slows down again.

Yiling-J · 2024-09-18T01:40:30Z

Both are slow on main branch (main:d8c0fe1). I'm not sure if this is intentional, but if the contention case is expected, then it might not be a bug.

Python 3.14.0a0 experimental free-threading build (heads/main:d8c0fe1, Sep 18 2024, 09:23:28) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
44,270,189

threading class
432,229,283
434,219,158
434,082,051
435,817,421
436,193,162
436,824,350

no threading function
34,629,009

threading function
310,946,617
310,733,662
314,075,160
314,107,274
314,759,811
314,907,503

ZeroIntensity · 2024-09-18T03:18:40Z

I wouldn't expect any speedup on a build with the GIL enabled. Since there's no I/O here, the GIL is never released when waiting for something, and since only one thread can execute at a time, the result is just extra overhead introduced by threading.

Yiling-J · 2024-09-18T04:26:29Z

@ZeroIntensity Thanks for keeping an eye on my issue. From what I understand, the experimental free-threading build means I'm already using a no-GIL build? Let me summarize my findings. I always use 6 threads on a 6-core machine:

On Python 3.13.0rc1 (free-threading build), instance methods in each thread are 10x slower than the non-threaded version, but function performance is comparable.
On Python 3.14.0a1 (free-threading build, main branch), instance methods are still slow, and now functions are also 10x slower.
Multi-threaded read-only access to tuple([1]) is significantly slower than (1,) on the 3.13.0rc1 free-threading build.

Fidget-Spinner · 2024-09-18T06:58:09Z

Main branch slowness is intended. We disabled immortalization to introduce deferred refcounting. Deferred refcounting is not yet done so expect slowness on main

ZeroIntensity · 2024-09-18T11:49:28Z

From what I understand, the experimental free-threading build means I'm already using a no-GIL build?

Sorry, I thought you meant that you did a fresh build of the main branch (with the GIL enabled). I wasn't aware of the DRC problem on main, but yeah, that would explain it.

…ence counting (GH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>

ZeroIntensity · 2024-11-13T13:33:27Z

Now that we have an official API for DRC, I'm going to close this. Thanks, @Yiling-J!

Yiling-J · 2024-11-13T14:17:40Z

Looks like I'll have to learn some C and CPython to make my cache library scalable. Honestly I still think the best approach would be for CPython to optimize these by default, scalability is also important for free-threading Python. It would be even better if there were an official document telling developers what is scalable for free-threading CPython. @ZeroIntensity

ZeroIntensity · 2024-11-13T14:24:16Z

You shouldn't need to learn much C, you can use the function pretty easily with ctypes:

import ctypes

enable_drc = ctypes.pythonapi.PyUnstable_Object_EnableDeferredRefcount
enable_drc.argtypes = (ctypes.py_object,)
enable_drc.restype = ctypes.c_int

enable_drc(whatever_you_want)

Yiling-J · 2024-11-17T02:34:57Z

@ZeroIntensity I ran another round of benchmark using enable_drc, still with the same benchmark code provided in the issue:

`get` function:

no threading

36,903,501

without enable_drc(get)

threading function
293,722,026
295,448,364
298,927,239
299,456,082
298,794,045
299,118,844

enable_drc(get)

threading function
312,284,200
314,333,104
315,227,775
314,082,833
316,477,189
315,571,743

Seems it's getiing slower when enable drc.

`client.get` method:

no threading

35,972,754

without enable_drc(client)

threading class
327,862,111
328,811,235
331,339,880
332,376,748
332,962,384
332,732,721

enable_drc(client)

threading class
279,908,721
282,705,066
283,902,274
284,577,039
285,535,246
287,067,314

a little faster, not much.

Is this the expected behavior of DRC? Ideally, the throughput of the multi-threaded version should align with the single-threaded version, as seen in the 3.13 Immortal behavior. However, it seems the Immortal API is not public yet?

ZeroIntensity · 2024-11-17T02:47:39Z

I'm not sure what the state of DRC is on main, it might still be disabled/suboptimal.

Ideally, the throughput of the multi-threaded version should align with the single-threaded version, as seen in the 3.13 Immortal behavior.

Note that you're not measuring throughput here, you're measuring time taken.

Yiling-J · 2024-11-17T03:18:36Z

Note that you're not measuring throughput here, you're measuring time taken.

That's true but throughput can easily be derived from the time taken, and if the throughput aligns, the time should as well. My main point is that the issue still persists—the performance doesn't match my expectations.

ZeroIntensity · 2024-11-17T03:33:23Z

Sort of, but there's not much actual computing happening here--the interpreter is just trying to pass an object around and mess with its reference count, which will basically always be faster on the main thread because of biased reference counting. So, as a result, the bottleneck is the reference count operation, which isn't really a fair benchmark. This isn't a problem limited to Python; in basically any language with multithreading, you need some sort of atomic operation to use data concurrently, and if you only measure the cost of that against single-threaded code, then of course it will seem slower.

… reference counting (pythonGH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>

Yiling-J added the type-bug An unexpected behavior, bug, or error label Sep 3, 2024

Eclips4 added performance Performance or resource usage topic-free-threading labels Sep 3, 2024

bedevere-app bot mentioned this issue Sep 3, 2024

gh-123619: Add an unstable C API function for enabling deferred reference counting #123635

Merged

Yiling-J mentioned this issue Sep 5, 2024

Theine V2 Yiling-J/theine#28

Open

Yiling-J mentioned this issue Sep 19, 2024

Add unstable C API function for enabling deferred reference counting (PyUnstable_Object_EnableDeferredRefcount) capi-workgroup/decisions#42

Closed

encukou pushed a commit that referenced this issue Nov 13, 2024

gh-123619: Add an unstable C API function for enabling deferred refer…

d00878b

…ence counting (GH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>

ZeroIntensity closed this as completed Nov 13, 2024

picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024

pythongh-123619: Add an unstable C API function for enabling deferred…

6f013d1

… reference counting (pythonGH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>

ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025

pythongh-123619: Add an unstable C API function for enabling deferred…

7d78d0f

… reference counting (pythonGH-123635) Co-authored-by: Sam Gross <colesbury@gmail.com>

Uh oh!

Instance method performance issue with free threading (3.13rc1) #123619

Instance method performance issue with free threading (3.13rc1) #123619

Comments

Yiling-J commented Sep 3, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

ZeroIntensity commented Sep 3, 2024

Uh oh!

Fidget-Spinner commented Sep 3, 2024

Uh oh!

ZeroIntensity commented Sep 3, 2024

Uh oh!

ZeroIntensity commented Sep 3, 2024

Uh oh!

Yiling-J commented Sep 4, 2024

Uh oh!

ZeroIntensity commented Sep 4, 2024

Uh oh!

corona10 commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity commented Sep 5, 2024

Uh oh!

Yiling-J commented Sep 6, 2024

Uh oh!

ZeroIntensity commented Sep 6, 2024

Uh oh!

Yiling-J commented Sep 6, 2024

Uh oh!

Yiling-J commented Sep 18, 2024

Uh oh!

ZeroIntensity commented Sep 18, 2024

Uh oh!

Yiling-J commented Sep 18, 2024

Uh oh!

Fidget-Spinner commented Sep 18, 2024

Uh oh!

ZeroIntensity commented Sep 18, 2024

Uh oh!

ZeroIntensity commented Nov 13, 2024

Uh oh!

Yiling-J commented Nov 13, 2024

Uh oh!

ZeroIntensity commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yiling-J commented Nov 17, 2024

get function:

client.get method:

Uh oh!

ZeroIntensity commented Nov 17, 2024

Uh oh!

Yiling-J commented Nov 17, 2024

Uh oh!

ZeroIntensity commented Nov 17, 2024

Uh oh!

Yiling-J commented Sep 3, 2024 •

edited by bedevere-app bot

Loading

corona10 commented Sep 5, 2024 •

edited

Loading

ZeroIntensity commented Nov 13, 2024 •

edited

Loading

`get` function:

`client.get` method: