Skip to content

bpo-40521: Make tuple free list per-interpreter #20247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 4, 2020
Merged

bpo-40521: Make tuple free list per-interpreter #20247

merged 3 commits into from
Jun 4, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented May 19, 2020

Each interpreter now has its own tuple free lists:

  • Move tuple numfree and free_list arrays into PyInterpreterState.
  • Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in
    pycore_interp.h.
  • Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc().
  • Add tstate parameter to _PyTuple_ClearFreeList()
  • Each interpreter now has its own empty tuple singleton.

https://bugs.python.org/issue40521

@vstinner
Copy link
Member Author

This change shows an old reference leak, likely because the empty tuple singleton is now per-interpreter.

  • test_atexit leaked [3, 3, 3] references, sum=9
  • test__xxsubinterpreters leaked [79, 79, 79] references, sum=237
  • test_capi leaked [2, 2, 2] references, sum=6
  • test_threading leaked [2, 2, 2] references, sum=6

Example:

vstinner@apu$ ./python -m test -R 3:3 test_atexit  -m test.test_atexit.SubinterpreterTest.test_callbacks_leak
(...)
test_atexit leaked [1, 1, 1] references, sum=3
(...)

@vstinner
Copy link
Member Author

This change shows an old reference leak, likely because the empty tuple singleton is now per-interpreter.

I fixed all leaks by clearing the empty tuple singleton in Py_EndInterpreter().

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

PyInterpreterState *interp = _PyInterpreterState_GET();
struct _Py_tuple_state *state = &interp->tuple;

This kinds of code looks quite redundent. :)

@vstinner
Copy link
Member Author

This kinds of code looks quite redundent. :)

I made it on purpose. My expectation is that in the near-future, more and more code will require to get access to the Python thread state and the interpreter, so I made it more explicit. See also:

Also, I tried to reduce the number of times that we get the current interpreter, to avoid any performance drawback if this operation has to become slower tomorrow (to support subinterpreters).

@tiran
Copy link
Member

tiran commented May 20, 2020

What's the speed impact of the additional indirection and state call?

@vstinner
Copy link
Member Author

vstinner commented May 20, 2020

@tiran: "What's the speed impact of the additional indirection and state call?"

I wrote a microbenchmark to measure the worst case: get the empty tuple singleton and create short tuple. The overhead is between +0.7 ns and +2.1 ns per tuple creation:

vstinner@apu$ python3 -m pyperf compare_to ref.json state.json  --table
+------------------+---------+------------------------------+
| Benchmark        | ref     | state                        |
+==================+=========+==============================+
| tuple()          | 16.1 ns | 17.7 ns: 1.10x slower (+10%) |
+------------------+---------+------------------------------+
| tuple([])        | 22.8 ns | 23.5 ns: 1.03x slower (+3%)  |
+------------------+---------+------------------------------+
| tuple([None])    | 35.9 ns | 37.9 ns: 1.06x slower (+6%)  |
+------------------+---------+------------------------------+
| tuple([1, 2, 3]) | 38.5 ns | 40.6 ns: 1.05x slower (+5%)  |
+------------------+---------+------------------------------+

I ran this micro-benchmark on Python built with LTO, but without PGO.

Benchmark:

import pyperf
runner = pyperf.Runner()

# tuple_new(): PyTuple_New(0): get empty tuple singleton
runner.timeit('tuple()', 'local_tuple()', setup='local_tuple=tuple', duplicate=1024)

# _PyTuple_FromArray([]): get empty tuple singleton
runner.timeit('tuple([])', 'local_tuple(empty_list)', setup='local_tuple=tuple; empty_list=[]', duplicate=1024)

# _PyTuple_FromArray([None]) -> _PyTuple_FromArray([None], 1)
runner.timeit('tuple([None])', 'local_tuple(items)', setup='local_tuple=tuple; items=[None]', duplicate=1024)

# _PyTuple_FromArray([1, 2, 3]) -> _PyTuple_FromArray([1, 2, 3], 3)
runner.timeit('tuple([1, 2, 3])', 'local_tuple(items)', setup='local_tuple=tuple; items=[1, 2, 3]', duplicate=1024)

@vstinner
Copy link
Member Author

@serhiy-storchaka @methane @pablogsal: Would you mind to review this PR?

@serhiy-storchaka
Copy link
Member

local_tuple() has an overhead of interpreting bytecode and function call. Could you use some code which creates and destroys tuples in a tight loop in C?

Temporary tuple creation time is performance critical. It was the main reason of implementing the fastcall protocol. Some iterators even cache the tuple object to avoid an overhead of tuple creation and destruction. Just remember the effort spent on optimizing property.

@vstinner
Copy link
Member Author

Temporary tuple creation time is performance critical. It was the main reason of implementing the fastcall protocol. Some iterators even cache the tuple object to avoid an overhead of tuple creation and destruction.

What's your point? Does it mean that you are against this PR? Do you see another approach to make the tuple free lists compatible with sub-interpreters?

Or does it mean that you are against the overall project of having one GIL per interpreter? https://bugs.python.org/issue40512

My rationale is that it's fine to make Python 5 to 10% slower overall (once all "per-interpreter" changes will be merged), if it allows make Python at least "2x faster" (for workloads which can be distributed in subinterpreters).

Just remember the effort spent on optimizing property.

I have very bad memories of this micro-optimization. It was a very ugly hack which caused many crashs and we had to fix it 2 or 3 times. If I recall correctly, I removed it because it was a hack and beceause it caused too many bugs.

I wrote the FASTCALL calling convention just to be able to remove this hack :-)

vstinner added 3 commits June 3, 2020 00:29
Each interpreter now has its own tuple free lists:

* Move tuple numfree and free_list arrays into PyInterpreterState.
* Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in
  pycore_interp.h.
* Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc().
* Add tstate parameter to _PyTuple_ClearFreeList()
* Each interpreter now has its own empty tuple singleton.
@vstinner
Copy link
Member Author

vstinner commented Jun 2, 2020

local_tuple() has an overhead of interpreting bytecode and function call. Could you use some code which creates and destroys tuples in a tight loop in C?

tl; dr IMO the overhead is not significant. It's so small that it is really hard to measure.

I attached microbench_tuple.py to https://bugs.python.org/issue40521 which requires to apply bench_tuple.patch (also attached to the same bpo).

It is really hard to measure differences with my change since differences are around 1 ns, knowning that creating a tuple of 1 item takes around 20 ns... See my previous benchmark ("The overhead is between +0.7 ns and +2.1 ns per tuple creation"): #20247 (comment)

Depending on how Python is compiled, results say "faster" or "slower". My understanding is that differences are so small (less than 5% faster or slower), the difference is not significant.

pyperf requires 2^25 loops to get a run longer than 100 ms on my machine.

I ran benchmarks on Fedora 32 with CPU isolation.


Results of microbench_tuple.py with CPU isolation, LTO and PGO. It is the most reliable way to measure benchmarks in my knowledge.

+--------------------------------+---------+-----------------------------+
| Benchmark                      | ref     | patch                       |
+================================+=========+=============================+
| ()                             | 4.84 ns | 4.47 ns: 1.08x faster (-8%) |
+--------------------------------+---------+-----------------------------+
| (None,)                        | 18.2 ns | 19.0 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+
| (None, None, None, None, None) | 30.1 ns | 31.3 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+

Creating an empty tuple is "faster": that's likely noise in the benchmark since numbers are too small for pyperf. Or maybe it is because PGO decided to optimize the code differently and next build will get different performance...

std dev:

(): Mean +- std dev: [ref] 4.84 ns +- 0.00 ns -> [patch] 4.47 ns +- 0.01 ns: 1.08x faster (-8%)
(None,): Mean +- std dev: [ref] 18.2 ns +- 0.2 ns -> [patch] 19.0 ns +- 0.2 ns: 1.04x slower (+4%)
(None, None, None, None, None): Mean +- std dev: [ref] 30.1 ns +- 0.1 ns -> [patch] 31.3 ns +- 0.2 ns: 1.04x slower (+4%)

Differences:

  • Empty tuple: -0.4 ns
  • tuple of 1 item: +0.8 ns
  • tuple of 5 items: +1.2 ns

Note: building Python with LTO+PGO takes around 10 min, so the overall benchmark takes me around 30 min.


Second benchmark run, without PGO, only LTO and CPU isolation, to check results.

+--------------------------------+---------+------------------------------+
| Benchmark                      | ref-lto | patch-lto                    |
+================================+=========+==============================+
| ()                             | 3.72 ns | 3.96 ns: 1.06x slower (+6%)  |
+--------------------------------+---------+------------------------------+
| (None,)                        | 22.5 ns | 19.8 ns: 1.14x faster (-12%) |
+--------------------------------+---------+------------------------------+
| (None, None, None, None, None) | 33.3 ns | 31.8 ns: 1.05x faster (-5%)  |
+--------------------------------+---------+------------------------------+

std dev:

(): Mean +- std dev: [ref-lto] 3.72 ns +- 0.03 ns -> [patch-lto] 3.96 ns +- 0.11 ns: 1.06x slower (+6%)
(None,): Mean +- std dev: [ref-lto] 22.5 ns +- 0.1 ns -> [patch-lto] 19.8 ns +- 0.1 ns: 1.14x faster (-12%)
(None, None, None, None, None): Mean +- std dev: [ref-lto] 33.3 ns +- 0.1 ns -> [patch-lto] 31.8 ns +- 0.2 ns: 1.05x faster (-5%)

Differences:

  • Empty tuple: +0.2 ns
  • tuple of 1 item: -2.7 ns
  • tuple of 5 items: -1.5 ns

The benchmark () measures PyTuple_New(0) which does almost nothing. In short, it measures:

obj = free_list[0];
Py_INCREF(obj);
Py_DECREF(obj);

and the cost of a function call in C: clal PyTuple_New().


The benchmark (None,) mostly measures PyTuple_New(1) which does:

obj = free_list[1];
 numfree[1]--;
_Py_NewReference(obj);
obj->ob_item[0] = NULL;
_PyObject_GC_TRACK(obj);

Py_INCREF(Py_None);
obj->ob_item[0] = Py_None;

PyObject_GC_UnTrack(obj);
Py_XDECREF(obj->ob_item[0]);

obj->ob_item[0] = free_list[1];
numfree[1]++;
free_list[1] = obj;

@vstinner
Copy link
Member Author

vstinner commented Jun 2, 2020

I rebased the PR on the master branch.

@vstinner
Copy link
Member Author

vstinner commented Jun 3, 2020

@serhiy-storchaka @methane @pablogsal: So, do you think that the very small performance overhead is acceptable? See my previous comment for all numbers. IMO the overhead is not significant.

@methane
Copy link
Member

methane commented Jun 4, 2020

I'm OK for this. If we decide to go other way (e.g. throw away refcounting) in the future, we can revert these changes and share immutables (e.g. None, True, False, (), etc...) anyway.

@vstinner
Copy link
Member Author

vstinner commented Jun 4, 2020

I'm OK for this.

Great!

If we decide to go other way (e.g. throw away refcounting) in the future, we can revert these changes and share immutables (e.g. None, True, False, (), etc...) anyway.

Yeah, that's another option for the empty tuple singleton. It's discussed at https://bugs.python.org/issue39511

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some benchmarks myself and I see similar results as the ones provided by @vstinner. Giving that the differences are nanoseconds here and the fact that the patch is localized and easy to revert in case we notice some unintended consequences I am also OK with proceeding with this. I also think even without considering subinterpreters, this can help to reduce the global state, making it easier in the future to improve the embedding experience.

@vstinner vstinner merged commit 69ac6e5 into python:master Jun 4, 2020
@vstinner vstinner deleted the tuple_free_list branch June 4, 2020 21:38
@vstinner
Copy link
Member Author

vstinner commented Jun 4, 2020

Thank you very much @corona10, @methane and @pablogsal! I wasn't comfortable with touching such core builtin type of Python. As Serhiy reminded, tuple performances are critical for Python!

More general note about performance: https://speed.python.org/ data should be recreated since old data were genered with of pyperf and pyperformance versions. Also, the system was upgraded which makes results harder to analyze :-/ https://speed.python.org/ should be a key tool to track CPython performances!

arun-mani-j pushed a commit to arun-mani-j/cpython that referenced this pull request Jul 21, 2020
Each interpreter now has its own tuple free lists:

* Move tuple numfree and free_list arrays into PyInterpreterState.
* Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in
  pycore_interp.h.
* Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc().
* Add tstate parameter to _PyTuple_ClearFreeList()
* Each interpreter now has its own empty tuple singleton.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants