-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
bpo-40521: Make tuple free list per-interpreter #20247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change shows an old reference leak, likely because the empty tuple singleton is now per-interpreter.
Example:
|
I fixed all leaks by clearing the empty tuple singleton in Py_EndInterpreter(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
PyInterpreterState *interp = _PyInterpreterState_GET();
struct _Py_tuple_state *state = &interp->tuple;
This kinds of code looks quite redundent. :)
I made it on purpose. My expectation is that in the near-future, more and more code will require to get access to the Python thread state and the interpreter, so I made it more explicit. See also: Also, I tried to reduce the number of times that we get the current interpreter, to avoid any performance drawback if this operation has to become slower tomorrow (to support subinterpreters). |
What's the speed impact of the additional indirection and state call? |
@tiran: "What's the speed impact of the additional indirection and state call?" I wrote a microbenchmark to measure the worst case: get the empty tuple singleton and create short tuple. The overhead is between +0.7 ns and +2.1 ns per tuple creation:
I ran this micro-benchmark on Python built with LTO, but without PGO. Benchmark:
|
@serhiy-storchaka @methane @pablogsal: Would you mind to review this PR? |
Temporary tuple creation time is performance critical. It was the main reason of implementing the fastcall protocol. Some iterators even cache the tuple object to avoid an overhead of tuple creation and destruction. Just remember the effort spent on optimizing |
What's your point? Does it mean that you are against this PR? Do you see another approach to make the tuple free lists compatible with sub-interpreters? Or does it mean that you are against the overall project of having one GIL per interpreter? https://bugs.python.org/issue40512 My rationale is that it's fine to make Python 5 to 10% slower overall (once all "per-interpreter" changes will be merged), if it allows make Python at least "2x faster" (for workloads which can be distributed in subinterpreters).
I have very bad memories of this micro-optimization. It was a very ugly hack which caused many crashs and we had to fix it 2 or 3 times. If I recall correctly, I removed it because it was a hack and beceause it caused too many bugs. I wrote the FASTCALL calling convention just to be able to remove this hack :-) |
Each interpreter now has its own tuple free lists: * Move tuple numfree and free_list arrays into PyInterpreterState. * Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in pycore_interp.h. * Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc(). * Add tstate parameter to _PyTuple_ClearFreeList() * Each interpreter now has its own empty tuple singleton.
tl; dr IMO the overhead is not significant. It's so small that it is really hard to measure. I attached microbench_tuple.py to https://bugs.python.org/issue40521 which requires to apply bench_tuple.patch (also attached to the same bpo). It is really hard to measure differences with my change since differences are around 1 ns, knowning that creating a tuple of 1 item takes around 20 ns... See my previous benchmark ("The overhead is between +0.7 ns and +2.1 ns per tuple creation"): #20247 (comment) Depending on how Python is compiled, results say "faster" or "slower". My understanding is that differences are so small (less than 5% faster or slower), the difference is not significant. pyperf requires 2^25 loops to get a run longer than 100 ms on my machine. I ran benchmarks on Fedora 32 with CPU isolation. Results of microbench_tuple.py with CPU isolation, LTO and PGO. It is the most reliable way to measure benchmarks in my knowledge.
Creating an empty tuple is "faster": that's likely noise in the benchmark since numbers are too small for pyperf. Or maybe it is because PGO decided to optimize the code differently and next build will get different performance... std dev:
Differences:
Note: building Python with LTO+PGO takes around 10 min, so the overall benchmark takes me around 30 min. Second benchmark run, without PGO, only LTO and CPU isolation, to check results.
std dev:
Differences:
The benchmark
and the cost of a function call in C: clal The benchmark
|
I rebased the PR on the master branch. |
@serhiy-storchaka @methane @pablogsal: So, do you think that the very small performance overhead is acceptable? See my previous comment for all numbers. IMO the overhead is not significant. |
I'm OK for this. If we decide to go other way (e.g. throw away refcounting) in the future, we can revert these changes and share immutables (e.g. None, True, False, (), etc...) anyway. |
Great!
Yeah, that's another option for the empty tuple singleton. It's discussed at https://bugs.python.org/issue39511 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some benchmarks myself and I see similar results as the ones provided by @vstinner. Giving that the differences are nanoseconds here and the fact that the patch is localized and easy to revert in case we notice some unintended consequences I am also OK with proceeding with this. I also think even without considering subinterpreters, this can help to reduce the global state, making it easier in the future to improve the embedding experience.
Thank you very much @corona10, @methane and @pablogsal! I wasn't comfortable with touching such core builtin type of Python. As Serhiy reminded, tuple performances are critical for Python! More general note about performance: https://speed.python.org/ data should be recreated since old data were genered with of pyperf and pyperformance versions. Also, the system was upgraded which makes results harder to analyze :-/ https://speed.python.org/ should be a key tool to track CPython performances! |
Each interpreter now has its own tuple free lists: * Move tuple numfree and free_list arrays into PyInterpreterState. * Define PyTuple_MAXSAVESIZE and PyTuple_MAXFREELIST macros in pycore_interp.h. * Add _Py_tuple_state structure. Pass it explicitly to tuple_alloc(). * Add tstate parameter to _PyTuple_ClearFreeList() * Each interpreter now has its own empty tuple singleton.
Each interpreter now has its own tuple free lists:
pycore_interp.h.
https://bugs.python.org/issue40521