bpo-40521: Make tuple free list per-interpreter #20247

vstinner · 2020-05-19T23:22:09Z

Each interpreter now has its own tuple free lists:

Move tuple numfree and free_list arrays into PyInterpreterState.
Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in
pycore_interp.h.
Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc().
Add tstate parameter to _PyTuple_ClearFreeList()
Each interpreter now has its own empty tuple singleton.

https://bugs.python.org/issue40521

vstinner · 2020-05-19T23:24:47Z

This change shows an old reference leak, likely because the empty tuple singleton is now per-interpreter.

test_atexit leaked [3, 3, 3] references, sum=9
test__xxsubinterpreters leaked [79, 79, 79] references, sum=237
test_capi leaked [2, 2, 2] references, sum=6
test_threading leaked [2, 2, 2] references, sum=6

Example:

vstinner@apu$ ./python -m test -R 3:3 test_atexit  -m test.test_atexit.SubinterpreterTest.test_callbacks_leak
(...)
test_atexit leaked [1, 1, 1] references, sum=3
(...)

vstinner · 2020-05-19T23:50:35Z

This change shows an old reference leak, likely because the empty tuple singleton is now per-interpreter.

I fixed all leaks by clearing the empty tuple singleton in Py_EndInterpreter().

corona10

nit:

PyInterpreterState *interp = _PyInterpreterState_GET();
struct _Py_tuple_state *state = &interp->tuple;

This kinds of code looks quite redundent. :)

vstinner · 2020-05-20T13:52:24Z

This kinds of code looks quite redundent. :)

I made it on purpose. My expectation is that in the near-future, more and more code will require to get access to the Python thread state and the interpreter, so I made it more explicit. See also:

Also, I tried to reduce the number of times that we get the current interpreter, to avoid any performance drawback if this operation has to become slower tomorrow (to support subinterpreters).

tiran · 2020-05-20T15:35:51Z

What's the speed impact of the additional indirection and state call?

vstinner · 2020-05-20T17:21:20Z

@tiran: "What's the speed impact of the additional indirection and state call?"

I wrote a microbenchmark to measure the worst case: get the empty tuple singleton and create short tuple. The overhead is between +0.7 ns and +2.1 ns per tuple creation:

vstinner@apu$ python3 -m pyperf compare_to ref.json state.json  --table
+------------------+---------+------------------------------+
| Benchmark        | ref     | state                        |
+==================+=========+==============================+
| tuple()          | 16.1 ns | 17.7 ns: 1.10x slower (+10%) |
+------------------+---------+------------------------------+
| tuple([])        | 22.8 ns | 23.5 ns: 1.03x slower (+3%)  |
+------------------+---------+------------------------------+
| tuple([None])    | 35.9 ns | 37.9 ns: 1.06x slower (+6%)  |
+------------------+---------+------------------------------+
| tuple([1, 2, 3]) | 38.5 ns | 40.6 ns: 1.05x slower (+5%)  |
+------------------+---------+------------------------------+

I ran this micro-benchmark on Python built with LTO, but without PGO.

Benchmark:

import pyperf
runner = pyperf.Runner()

# tuple_new(): PyTuple_New(0): get empty tuple singleton
runner.timeit('tuple()', 'local_tuple()', setup='local_tuple=tuple', duplicate=1024)

# _PyTuple_FromArray([]): get empty tuple singleton
runner.timeit('tuple([])', 'local_tuple(empty_list)', setup='local_tuple=tuple; empty_list=[]', duplicate=1024)

# _PyTuple_FromArray([None]) -> _PyTuple_FromArray([None], 1)
runner.timeit('tuple([None])', 'local_tuple(items)', setup='local_tuple=tuple; items=[None]', duplicate=1024)

# _PyTuple_FromArray([1, 2, 3]) -> _PyTuple_FromArray([1, 2, 3], 3)
runner.timeit('tuple([1, 2, 3])', 'local_tuple(items)', setup='local_tuple=tuple; items=[1, 2, 3]', duplicate=1024)

vstinner · 2020-05-20T17:36:12Z

@serhiy-storchaka @methane @pablogsal: Would you mind to review this PR?

serhiy-storchaka · 2020-05-20T19:49:36Z

local_tuple() has an overhead of interpreting bytecode and function call. Could you use some code which creates and destroys tuples in a tight loop in C?

Temporary tuple creation time is performance critical. It was the main reason of implementing the fastcall protocol. Some iterators even cache the tuple object to avoid an overhead of tuple creation and destruction. Just remember the effort spent on optimizing property.

vstinner · 2020-05-25T16:26:00Z

Temporary tuple creation time is performance critical. It was the main reason of implementing the fastcall protocol. Some iterators even cache the tuple object to avoid an overhead of tuple creation and destruction.

What's your point? Does it mean that you are against this PR? Do you see another approach to make the tuple free lists compatible with sub-interpreters?

Or does it mean that you are against the overall project of having one GIL per interpreter? https://bugs.python.org/issue40512

My rationale is that it's fine to make Python 5 to 10% slower overall (once all "per-interpreter" changes will be merged), if it allows make Python at least "2x faster" (for workloads which can be distributed in subinterpreters).

Just remember the effort spent on optimizing property.

I have very bad memories of this micro-optimization. It was a very ugly hack which caused many crashs and we had to fix it 2 or 3 times. If I recall correctly, I removed it because it was a hack and beceause it caused too many bugs.

I wrote the FASTCALL calling convention just to be able to remove this hack :-)

Each interpreter now has its own tuple free lists: * Move tuple numfree and free_list arrays into PyInterpreterState. * Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in pycore_interp.h. * Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc(). * Add tstate parameter to _PyTuple_ClearFreeList() * Each interpreter now has its own empty tuple singleton.

vstinner · 2020-06-02T23:48:26Z

local_tuple() has an overhead of interpreting bytecode and function call. Could you use some code which creates and destroys tuples in a tight loop in C?

tl; dr IMO the overhead is not significant. It's so small that it is really hard to measure.

I attached microbench_tuple.py to https://bugs.python.org/issue40521 which requires to apply bench_tuple.patch (also attached to the same bpo).

It is really hard to measure differences with my change since differences are around 1 ns, knowning that creating a tuple of 1 item takes around 20 ns... See my previous benchmark ("The overhead is between +0.7 ns and +2.1 ns per tuple creation"): #20247 (comment)

Depending on how Python is compiled, results say "faster" or "slower". My understanding is that differences are so small (less than 5% faster or slower), the difference is not significant.

pyperf requires 2^25 loops to get a run longer than 100 ms on my machine.

I ran benchmarks on Fedora 32 with CPU isolation.

Results of microbench_tuple.py with CPU isolation, LTO and PGO. It is the most reliable way to measure benchmarks in my knowledge.

+--------------------------------+---------+-----------------------------+
| Benchmark                      | ref     | patch                       |
+================================+=========+=============================+
| ()                             | 4.84 ns | 4.47 ns: 1.08x faster (-8%) |
+--------------------------------+---------+-----------------------------+
| (None,)                        | 18.2 ns | 19.0 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+
| (None, None, None, None, None) | 30.1 ns | 31.3 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+

Creating an empty tuple is "faster": that's likely noise in the benchmark since numbers are too small for pyperf. Or maybe it is because PGO decided to optimize the code differently and next build will get different performance...

std dev:

(): Mean +- std dev: [ref] 4.84 ns +- 0.00 ns -> [patch] 4.47 ns +- 0.01 ns: 1.08x faster (-8%)
(None,): Mean +- std dev: [ref] 18.2 ns +- 0.2 ns -> [patch] 19.0 ns +- 0.2 ns: 1.04x slower (+4%)
(None, None, None, None, None): Mean +- std dev: [ref] 30.1 ns +- 0.1 ns -> [patch] 31.3 ns +- 0.2 ns: 1.04x slower (+4%)

Differences:

Empty tuple: -0.4 ns
tuple of 1 item: +0.8 ns
tuple of 5 items: +1.2 ns

Note: building Python with LTO+PGO takes around 10 min, so the overall benchmark takes me around 30 min.

Second benchmark run, without PGO, only LTO and CPU isolation, to check results.

+--------------------------------+---------+------------------------------+
| Benchmark                      | ref-lto | patch-lto                    |
+================================+=========+==============================+
| ()                             | 3.72 ns | 3.96 ns: 1.06x slower (+6%)  |
+--------------------------------+---------+------------------------------+
| (None,)                        | 22.5 ns | 19.8 ns: 1.14x faster (-12%) |
+--------------------------------+---------+------------------------------+
| (None, None, None, None, None) | 33.3 ns | 31.8 ns: 1.05x faster (-5%)  |
+--------------------------------+---------+------------------------------+

std dev:

(): Mean +- std dev: [ref-lto] 3.72 ns +- 0.03 ns -> [patch-lto] 3.96 ns +- 0.11 ns: 1.06x slower (+6%)
(None,): Mean +- std dev: [ref-lto] 22.5 ns +- 0.1 ns -> [patch-lto] 19.8 ns +- 0.1 ns: 1.14x faster (-12%)
(None, None, None, None, None): Mean +- std dev: [ref-lto] 33.3 ns +- 0.1 ns -> [patch-lto] 31.8 ns +- 0.2 ns: 1.05x faster (-5%)

Differences:

Empty tuple: +0.2 ns
tuple of 1 item: -2.7 ns
tuple of 5 items: -1.5 ns

The benchmark () measures PyTuple_New(0) which does almost nothing. In short, it measures:

obj = free_list[0];
Py_INCREF(obj);
Py_DECREF(obj);

and the cost of a function call in C: clal PyTuple_New().

The benchmark (None,) mostly measures PyTuple_New(1) which does:

obj = free_list[1];
 numfree[1]--;
_Py_NewReference(obj);
obj->ob_item[0] = NULL;
_PyObject_GC_TRACK(obj);

Py_INCREF(Py_None);
obj->ob_item[0] = Py_None;

PyObject_GC_UnTrack(obj);
Py_XDECREF(obj->ob_item[0]);

obj->ob_item[0] = free_list[1];
numfree[1]++;
free_list[1] = obj;

vstinner · 2020-06-02T23:51:15Z

I rebased the PR on the master branch.

vstinner · 2020-06-03T14:57:13Z

@serhiy-storchaka @methane @pablogsal: So, do you think that the very small performance overhead is acceptable? See my previous comment for all numbers. IMO the overhead is not significant.

methane · 2020-06-04T01:44:12Z

I'm OK for this. If we decide to go other way (e.g. throw away refcounting) in the future, we can revert these changes and share immutables (e.g. None, True, False, (), etc...) anyway.

vstinner · 2020-06-04T12:02:34Z

I'm OK for this.

Great!

If we decide to go other way (e.g. throw away refcounting) in the future, we can revert these changes and share immutables (e.g. None, True, False, (), etc...) anyway.

Yeah, that's another option for the empty tuple singleton. It's discussed at https://bugs.python.org/issue39511

pablogsal

I ran some benchmarks myself and I see similar results as the ones provided by @vstinner. Giving that the differences are nanoseconds here and the fact that the patch is localized and easy to revert in case we notice some unintended consequences I am also OK with proceeding with this. I also think even without considering subinterpreters, this can help to reduce the global state, making it easier in the future to improve the embedding experience.

vstinner · 2020-06-04T21:40:40Z

Thank you very much @corona10, @methane and @pablogsal! I wasn't comfortable with touching such core builtin type of Python. As Serhiy reminded, tuple performances are critical for Python!

More general note about performance: https://speed.python.org/ data should be recreated since old data were genered with of pyperf and pyperformance versions. Also, the system was upgraded which makes results harder to analyze :-/ https://speed.python.org/ should be a key tool to track CPython performances!

Each interpreter now has its own tuple free lists: * Move tuple numfree and free_list arrays into PyInterpreterState. * Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in pycore_interp.h. * Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc(). * Add tstate parameter to _PyTuple_ClearFreeList() * Each interpreter now has its own empty tuple singleton.

vstinner requested a review from pablogsal as a code owner May 19, 2020 23:22

the-knights-who-say-ni added the CLA signed label May 19, 2020

bedevere-bot added the awaiting core review label May 19, 2020

corona10 approved these changes May 20, 2020

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels May 20, 2020

corona10 reviewed May 20, 2020

View reviewed changes

vstinner added 3 commits June 3, 2020 00:29

Clear empty tuple singleton in Py_EndInterpreter()

efc11d6

Complete the NEWS entry

d472b89

pablogsal approved these changes Jun 4, 2020

View reviewed changes

vstinner merged commit 69ac6e5 into python:master Jun 4, 2020

bedevere-bot removed the awaiting merge label Jun 4, 2020

vstinner deleted the tuple_free_list branch June 4, 2020 21:38

vstinner mentioned this pull request Jun 5, 2020

bpo-40521: Make dict free lists per-interpreter #20645

Merged

vstinner mentioned this pull request Apr 10, 2022

[subinterpreters] Make free lists and unicode caches per-interpreter #84701

Closed

Uh oh!

bpo-40521: Make tuple free list per-interpreter #20247

bpo-40521: Make tuple free list per-interpreter #20247

Uh oh!

Conversation

vstinner commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented May 19, 2020

Uh oh!

vstinner commented May 19, 2020

Uh oh!

corona10 left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented May 20, 2020

Uh oh!

tiran commented May 20, 2020

Uh oh!

vstinner commented May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented May 20, 2020

Uh oh!

serhiy-storchaka commented May 20, 2020

Uh oh!

vstinner commented May 25, 2020

Uh oh!

vstinner commented Jun 2, 2020

Uh oh!

vstinner commented Jun 2, 2020

Uh oh!

vstinner commented Jun 3, 2020

Uh oh!

methane commented Jun 4, 2020

Uh oh!

vstinner commented Jun 4, 2020

Uh oh!

pablogsal left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Jun 4, 2020

Uh oh!

Uh oh!

vstinner commented May 19, 2020 •

edited

Loading

vstinner commented May 20, 2020 •

edited

Loading