Skip to content

Add _PyTuple_New_Nonzeroed #96446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

kmod
Copy link
Contributor

@kmod kmod commented Aug 30, 2022

A faster, but unsafe, version of PyTuple_New for use in performance-
critical paths.

Zeroing the contents of a tuple is surprisingly expensive, so for
cases where the contents of the tuple will immediately be overwritten,
time can be saved by calling this non-zeroing version.

This tuple is not safe to be seen by any other code since it will
have garbage where PyObject pointers are expected. If there is any
chance that the initialization code will raise an exception or do
a GC collection, it is not safe to use this function.

perf says that approximately 0.75% of the time in the pyperformance benchmarks
is spent in PyTuple_New, and 0.42% for the Pyston web macrobenchmarks. It
also says that roughly 35% of the time of PyTuple_New is spent in the loop that
zeroes the tuple contents, so if we are able to remove this then we can save
0.26% / 0.15% respectively.

I went through all calls to PyTuple_New and converted them if it was safe to do
so and if it seemed reasonable that the code could be performance-sensitive.
This migrates roughly 90% of the PyTuple_New time to _PyTuple_New_Nonzeroed.

Changes this small are difficult to measure, but looking at the new
macrobenchmarks profile I see that PyTuple_New + _PyTuple_New_Nonzeroed time is
0.26%, down from 0.45%. pyperformance and Pyston-macrobenchmarks performance
both improved by about 0.5% but I don't trust those numbers.

A faster, but unsafe, version of PyTuple_New for use in performance-
critical paths.

Zeroing the contents of a tuple is surprisingly expensive, so for
cases where the contents of the tuple will immediately be overwritten,
time can be saved by calling this non-zeroing version.

This tuple is not safe to be seen by any other code since it will
have garbage where PyObject pointers are expected. If there is any
chance that the initialization code will raise an exception or do
a GC collection, it is not safe to use this function.

perf says that approximately 0.75% of the time in the pyperformance benchmarks
is spent in PyTuple_New, and 0.42% for the Pyston web macrobenchmarks.  It
also says that roughly 35% of the time of PyTuple_New is spent in the loop that
zeroes the tuple contents, so if we are able to remove this then we can save
0.26% / 0.15% respectively.

I went through all calls to PyTuple_New and converted them if it was safe to do
so and if it seemed reasonable that the code could be performance-sensitive.
This migrates roughly 90% of the PyTuple_New time to _PyTuple_New_Nonzeroed.

Changes this small are difficult to measure, but looking at the new
macrobenchmarks profile I see that PyTuple_New + _PyTuple_New_Nonzeroed time is
0.26%, down from 0.45%. pyperformance and Pyston-macrobenchmarks performance
both improved by about 0.5% but I don't trust those numbers.
@vstinner
Copy link
Member

Can you please provide benchmarks showing the benefits of this change?

I'm not really excited by adding an unsafe private API for performance. Why not adding a fast public safe API instead?

There is PyTuple_Pack(2, item1, item2) for example. Should we make this function faster or add a new function? Can you try to measure the performance of replacing PyTuple_New() + PyTuple_SET_ITEM() with PyTuple_Pack()?

If we decide to add an unsafe function, I suggest to put Unsafe in the function name: _PyTuple_NewUnsafe(), or maybe Uninitialized. There is already an unsafe _PyGILState_GetInterpreterStateUnsafe() flavor of the the safe _PyInterpreterState_Get() function. There is also an unsafe _PyThreadState_UncheckedGet() flavor of the safe _PyThreadState_GET() function.

if (op == NULL) {
return NULL;
}
_PyObject_GC_TRACK(op);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bad practice: uninitialized tuples must be not tracked by the GC, it should only be tracked once all items are initialized. See #59313 I added _PyType_AllocNoTrack() to attempt to fix a similar issue in a few functions.

PyTuple_Pack() doesn't have this design flaw.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an internal _PyTuple_AllocNoTrack() function might make sense: the caller would have to call _PyObject_GC_TRACK() on the tuple once it's initialized.

If PyTuple_Pack() is inefficient and performance is so critical that PyTuple_Pack() code to handle various tuple lengths is an issue, maybe we should add functions specialized for some tuple lenghts? For the internal C API, it's acceptable. Like:

tuple = _PyTuple_Pack1(item1);
tuple = _PyTuple_Pack2(item1, item2);
tuple = _PyTuple_Pack3(item1, item2);
...

A problem of PyTuple_Pack() is that it creates new strong references rather than stealing references. Maybe we need a variant stealing references for the BUILD_TUPLE bytecode for example?

@kmod
Copy link
Contributor Author

kmod commented Aug 30, 2022

Hi Victor, could you clarify what benchmark results you would like beyond the benchmarks that I included?

Unfortunately PyTuple_Pack() is very slow due to being a C variadic function, so it's not suitable for performance-sensitive code. I could create a version of that function that takes an array, but my guess is that since the arguments are typically not already in an array this would be a pessimization vs the existing code. For example I'm not sure how to deal with code like this:

        TARGET(BUILD_TUPLE) {
            PyObject *tup = PyTuple_New(oparg);
            if (tup == NULL)
                goto error;
            while (--oparg >= 0) {
                PyObject *item = POP();
                PyTuple_SET_ITEM(tup, oparg, item);
            }
            PUSH(tup);
            DISPATCH();
        }

I'm definitely open to renaming the function. Personally I think that if it's already unsafe to do arbitrary code before this tuple is initialized, adding a second kind of unsafety (GC tracking) doesn't feel any worse to me. But if you feel strongly I'm happy to move GC-tracking to be the callers responsibility for this function.

@vstinner
Copy link
Member

Hi Victor, could you clarify what benchmark results you would like beyond the benchmarks that I included?

For example, write a C extension (or add a function to the _testcapi extension) which creates a tuple and delete it in a loop. You can use pyperf with the bench_time_func() method: https://pyperf.readthedocs.io/en/latest/examples.html#bench-time-func-method

Write one function using PyTuple_New(), a second function using _PyTuple_New_Nonzeroed().

Example:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 2d4c73cfe9..017a29023a 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5459,6 +5459,32 @@ test_macros(PyObject *self, PyObject *Py_UNUSED(args))
 }
 
 
+static PyObject*
+bench1(PyObject *self, PyObject *args)
+{
+    int i, loops;
+    if (!PyArg_ParseTuple(args, "i", &loops)) {
+        return NULL;
+    }
+
+    PyObject *item1 = Py_None;
+    PyObject *item2 = Py_False;
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (i=0; i < loops; i++) {
+        Py_INCREF(item1);
+        Py_INCREF(item2);
+
+        PyObject *obj = PyTuple_New(2);
+        PyTuple_SET_ITEM(obj, 0, item1);
+        PyTuple_SET_ITEM(obj, 0, item2);
+        Py_DECREF(obj);
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
 static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
 static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
 static PyObject *getargs_s_hash_int2(PyObject *, PyObject *, PyObject*);
@@ -5734,6 +5760,7 @@ static PyMethodDef TestMethods[] = {
     {"settrace_to_record", settrace_to_record, METH_O, NULL},
     {"test_macros", test_macros, METH_NOARGS, NULL},
     {"clear_managed_dict", clear_managed_dict, METH_O, NULL},
+    {"bench1", bench1, METH_VARARGS, NULL},
     {NULL, NULL} /* sentinel */
 };
 

With bench_new.py:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('tuple_new', _testcapi.bench1)

and bench_new_nonzero.py:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('tuple_new', _testcapi.bench2)

Compare with:

./python -m venv env
env/bin/python -m pip install pyperf
env/bin/python bench_new.py -o ref.json
env/bin/python bench_new_nonzero.py -o pr.json
env/bin/python -m pyperf compare_to ref.json pr.json

I didn't try all these commands, I let you fill the blanks :-)

@kmod
Copy link
Contributor Author

kmod commented Aug 31, 2022

I would say that the equivalent numbers that I provided from a realistic workload are closer to what we actually care about, but nevertheless here are the numbers that you asked for:

$ ./env/bin/python -m pyperf compare_to orig.json nonzeroed.json    
Mean +- std dev: [orig] 25.5 ns +- 0.2 ns -> [nonzeroed] 14.0 ns +- 0.5 ns: 1.81x faster

@mdboom mdboom added the performance Performance or resource usage label Aug 31, 2022
@vstinner
Copy link
Member

vstinner commented Sep 1, 2022

@serhiy-storchaka @methane @pablogsal @gvanrossum: What do you think of this micro-optimization? Is it worth it?

@vstinner
Copy link
Member

vstinner commented Sep 1, 2022

Mean +- std dev: [orig] 25.5 ns +- 0.2 ns -> [nonzeroed] 14.0 ns +- 0.5 ns: 1.81x faster

Oh thanks for the micro-benchmark! That's an useful data point, even if the most important one remains pyperformance results.

@pablogsal
Copy link
Member

pablogsal commented Sep 1, 2022

@serhiy-storchaka @methane @pablogsal @gvanrossum: What do you think of this micro-optimization? Is it worth it?

I am fine with it as long as this function remains private. I think the reasoning behind the API is sound and makes sense.

It would be great to double check the benchmarks bit the numbers sound reasonable.

@tiran
Copy link
Member

tiran commented Sep 1, 2022

If the NULL loop is a problem, have you tested how memset() performs here?

diff --git a/Objects/tupleobject.c b/Objects/tupleobject.c
index 240af0a9075..c48183339ee 100644
--- a/Objects/tupleobject.c
+++ b/Objects/tupleobject.c
@@ -76,9 +76,7 @@ PyTuple_New(Py_ssize_t size)
     if (op == NULL) {
         return NULL;
     }
-    for (Py_ssize_t i = 0; i < size; i++) {
-        op->ob_item[i] = NULL;
-    }
+    memset(op->ob_item, 0, sizeof(PyObject *) * size);
     _PyObject_GC_TRACK(op);
     return (PyObject *) op;
 }

@mdboom
Copy link
Contributor

mdboom commented Sep 1, 2022

@vstinner wrote:

If PyTuple_Pack() is inefficient and performance is so critical that PyTuple_Pack() code to handle various tuple lengths is an issue, maybe we should add functions specialized for some tuple lenghts?
A problem of PyTuple_Pack() is that it creates new strong references rather than stealing references. Maybe we need a variant stealing references for the BUILD_TUPLE bytecode for example?

I think it might be worth seeing what the performance of a borrowing version of PyTuple_Pack, for a number of small N would do. I can volunteer to measure that (for the same set of cases in this PR).

It still wouldn't address the dynamic N cases, of course. For the BUILD_TUPLE opcode, an array API would work. Probably not the others, though.

@vstinner
Copy link
Member

vstinner commented Sep 1, 2022

It still wouldn't address the dynamic N cases, of course. For the BUILD_TUPLE opcode, an array API would work. Probably not the others, though.

There is already the _PyTuple_FromArraySteal() function for that which is the same kind of optimization than this PR.

@markshannon
Copy link
Member

An API that creates an incomplete object is a bad API.
Unfortunately we already have PyTuple_New, but let's not make things worse.

We should make _PyTuple_FromArray() public, as it is the only sane API for creating tuples other than PyTuple_Pack()

@kmod
Copy link
Contributor Author

kmod commented Sep 1, 2022

Some context here is that we have quite a few of these optimizations that we are upstreaming from Pyston, so although each change is individually ~0.2% and easy to dismiss on its own, in aggregate we hope to get similar to the multiple percentage point improvement we got in 3.8. Unfortunately they probably all will come with an increase in complexity / decrease in elegance.

@tiran I believe the compiler is able to make this transformation, and because it didn't it decided it wasn't profitable, but I'll go ahead and test it

I'm still in favor of this change as is, but I'll go back through and see if there is a less-general change that will work for enough of the hotspots. (I thought the BUILD_TUPLE code was putting them in backwards so the FromArraySteal variant wouldn't work, and I'll see about some others.)

@mdboom
Copy link
Contributor

mdboom commented Sep 1, 2022

@kmod wrote:

(I thought the BUILD_TUPLE code was putting them in backwards so the FromArraySteal variant wouldn't work, and I'll see about some others.)

Indeed, you are correct.

I think this PR reveals the shortcoming of _PyTuple_FromArray/_PyTuple_FromArraySteal, which is that in most cases, the elements are not already neatly in an array, so you would need to create an array (perhaps expensively by allocating a new one on the heap, since it's probably unbounded), only to have it copied into the destination tuple. Given that, the create/set/gc-track pattern seems like a reasonable compromise, despite creating an unsafe window.

@gvanrossum
Copy link
Member

I’m with Pablo, it is fine as a private API.

Maybe consider a scarier name?

@serhiy-storchaka
Copy link
Member

This API is very unsafe. If any errors happens before the tuple has been fully initialized, you will decref a tuple which contains garbage. At least one site in ceval.c contains such bug. If arbitrary Python code is executed before the tuple has been fully initialized, you can see a tuple containing garbage in the list of all objects tracked by the GC. Even if the current code is safe and has no chance of failure or executing Python code before tuple has been fully initialized, it is very easy to break this condition in future versions.

Many years ago I did some research, and the half of the overhead of using a new tuple instead of reusing the same tuple was in the tuple deallocation. You cannot consider tuple creation without tuple deallocation, and there is no shortcuts in the latter, you have to loop and deref all items. What is worse, the trashcan mechanism adds additional overhead. 35% of the time of PyTuple_New is much smaller fraction of the full life cycle of the loop.

Taking into account a large danger of introducing new bugs, such API can be used only in limited cases. And every use should be surrounded by extensive warning comments. And for every particular case we need a microbenchmark which shows that there is a benefit of this optimization. Actually, adding new code can slowdown other parts of the code, because the CPU needs to cache more code now. It also adds maintenance burden.

@corona10
Copy link
Member

corona10 commented Sep 2, 2022

Maybe consider a scarier name?

What about naming it to _PyTuple_New_Uninitialized or _PyTuple_Uninitialized?

@methane
Copy link
Member

methane commented Sep 2, 2022

I can not reproduce 1.8x performance improvement. @kmod Would you share your benchmark code?

I used @vstinner's patch. Here is the patch:

https://gist.github.com/methane/1d5978319a0aa77823ef3d25ea78a177

  • bench1 (PyTuple_New(2))
  • bench2 (_PyTuple_New_Nonzeroed(2))
  • bench3 (_PyTuple_PackSteal2())
$ ./python -m  timeit -s 'from _testcapi import bench1' -- 'bench1(1000000)'
10 loops, best of 5: 29.7 msec per loop

$ ./python -m  timeit -s 'from _testcapi import bench2' -- 'bench2(1000000)'
10 loops, best of 5: 26.9 msec per loop

$ ./python -m  timeit -s 'from _testcapi import bench3' -- 'bench3(1000000)'
10 loops, best of 5: 25.7 msec per loop


$ ./python -m pyperf timeit -s 'from _testcapi import bench1' -- 'bench1(1000000)'
Mean +- std dev: 29.8 ms +- 0.2 ms

$ ./python -m pyperf timeit -s 'from _testcapi import bench2' -- 'bench2(1000000)'
segfault

$ ./python -m pyperf timeit -s 'from _testcapi import bench3' -- 'bench3(1000000)'
Mean +- std dev: 25.8 ms +- 0.2 ms

_PyTuple_New_Nonzeroed() is not surprising fast. Even worse, it caused segfault when I run bench2 under pyperf.
I had not debugged it yet.

@methane
Copy link
Member

methane commented Sep 2, 2022

@tiran It seems gcc use memset@plt automatically for current code.
But calling plt function is bit slow when n=2.
It is one of reason why _PyTuple_PackSteal2() in my patch is bit faster than PyTuple_New().

@mdboom
Copy link
Contributor

mdboom commented Sep 2, 2022

Regarding the safety of this proposed API, I have a possibly hare-brained suggestion (I'm new here). What if the tuple were initially filled with some sort of sentinel object when in debug mode, and that object asserted on every possible operation (especially deallocating). That would make it much more likely to catch any of the errors that @serhiy-storchaka is concerned about when running the unit tests on a debug build.

Regarding the questions earlier of "why not just use PyTuple_Pack", I have some data to report. I made two branches:

  • PyTuple_Pack uses the existing PyTuple_Pack API where the length of the tuple is statically known (and only for the cases that this PR made changes).
  • _PyTuple_BorrowPackN adds a new API which is like PyTuple_Pack but it steals references and doesn't use C variadic parameters. Again, only the cases where the tuple length are statically known were updated.

I also ran this PR on the same benchmarking hardware to make sure we're comparing apples to apples.
I ran these on the pyston macrobenchmarks (all but 3 that don't currently work on Python 3.11) and pyperformance.

  • PyTuple_Pack: 0.2% faster
  • _PyTuple_BorrowPackN: 0.02% slower
  • This PR: 0.3% faster

These diffs are really tiny. For this PR, the improvement is a bit smaller than @kmod's original report of 0.5%, which is probably due to hardware differences, and it's not surprising that they don't exactly match at this level. It's a bit hard to reach any solid conclusions, but I think the following is safe to say:

  • There's no advantage to a new _PyTuple_BorrowPackN API. The overhead of C variadic parameters in PyTuple_Pack doesn't seem to make things measurably worse. The fact that PyTuple_Pack creates new references is mostly a non-issue because the caller can be adjusted to compensate.
  • A hybrid of this PR and the PyTuple_Pack branch might make sense to investigate. That is, use PyTuple_Pack where the size is statically known, and use _PyTuple_New_Nonzeroed API from this PR where it isn't. That would limit the unsafety to fewer call sites. I can volunteer to measure this next, though may not have results until after the long weekend. EDIT: This is 0.7% faster.

kmod added a commit to kmod/cpython that referenced this pull request Sep 2, 2022
PyTuple_New will zero out the tuple before returning to the caller, and a
surprising amount of time can be saved by not doing this nonzeroing.
One option is to add a non-zeroing version of PyTuple_New, which I did in
 python#96446, but there was resistance to the unsafety of it.

Fortunately it looks like most of the tuple-zeroing happens directly from the
BUILD_TUPLE opcode in the interpreter, which already has the arguments in an
appropriate array, so we can just convert this to _PyTuple_FromArraySteal

This seems to result in a ~0.2% speedup on macrobenchmarks.
kmod added a commit to kmod/cpython that referenced this pull request Sep 2, 2022
PyTuple_New will zero out the tuple before returning to the caller, and a
surprising amount of time can be saved by not doing this nonzeroing.
One option is to add a non-zeroing version of PyTuple_New, which I did in
 python#96446, but there was resistance to the unsafety of it.

Fortunately it looks like most of the tuple-zeroing happens directly from the
BUILD_TUPLE opcode in the interpreter, which already has the arguments in an
appropriate array, so we can just convert this to _PyTuple_FromArraySteal

This seems to result in a ~0.2% speedup on macrobenchmarks.
kmod added a commit to kmod/cpython that referenced this pull request Sep 2, 2022
PyTuple_New will zero out the tuple before returning to the caller, and a
surprising amount of time can be saved by not doing this zeroing.  One option
is to add a non-zeroing version of PyTuple_New, which I did in python#96446, but
there was resistance to the unsafety of it.

Fortunately it looks like most of the tuple-zeroing happens directly from the
BUILD_TUPLE opcode in the interpreter, which already has the arguments in an
appropriate array, so we can just convert this to _PyTuple_FromArraySteal

This seems to result in a ~0.2% speedup on macrobenchmarks.
@kmod
Copy link
Contributor Author

kmod commented Sep 2, 2022

Here's a branch with the benchmark: https://github.com/kmod/cpython/tree/tuple_new_nonzeroed_bench
To run the benchmarks just do bash bench.sh

I got the same results on the machine I originally tested on:

+ ./env/bin/python -m pyperf compare_to baseline.json nonzeroed.json
Mean +- std dev: [baseline] 24.1 ns +- 0.1 ns -> [nonzeroed] 13.8 ns +- 0.0 ns: 1.74x faster
+ ./env/bin/python -m pyperf compare_to baseline.json pack.json
Mean +- std dev: [baseline] 24.1 ns +- 0.1 ns -> [pack] 17.1 ns +- 0.0 ns: 1.41x faster

But got pretty different results on my development machine:

+ ./env/bin/python -m pyperf compare_to baseline.json nonzeroed.json
Mean +- std dev: [baseline] 12.8 ns +- 0.2 ns -> [nonzeroed] 10.5 ns +- 2.1 ns: 1.22x faster
+ ./env/bin/python -m pyperf compare_to baseline.json pack.json
Mean +- std dev: [baseline] 12.8 ns +- 0.2 ns -> [pack] 11.0 ns +- 0.1 ns: 1.16x faster

So maybe this is fairly microarchitecture-dependent?

Also it looks like I was wrong about PyTuple_Pack, the benefits of not zeroing it out seem to outweigh the varargs cost.

@kmod
Copy link
Contributor Author

kmod commented Sep 2, 2022

I looked at it and essentially all of the benefit of this PR comes from changing the BUILD_TUPLE opcode, with the other locations not really mattering too much, so maybe we can scope this to just that call. When I said "I thought BUILD_TUPLE was putting them in backwards" I should have said "incorrectly thought" -- it looks like it puts them forwards so we can use the FromArraySteal variant.

I put up a new PR (#96516) with only that change. I still don't fully trust the measurements at this level, but I got a speedup of 0.4% on macrobenchmarks.

@mdboom
Copy link
Contributor

mdboom commented Sep 2, 2022

To close the loop on my above comment, using PyTuple_Pack for static-length tuples and the changes in this PR for everything else, I get a 0.7% speedup on standard Faster CPython benchmarking machine on the macrobenchmarks.

I also kicked off a run of #96516.

@vstinner
Copy link
Member

vstinner commented Sep 8, 2022

What if the tuple were initially filled with some sort of sentinel object when in debug mode, and that object asserted on every possible operation (especially deallocating)

In debug mode, the Python memory allocators do exactly that: https://docs.python.org/dev/c-api/memory.html#pymem-debug-hooks

Deallocating such tuple will likely crash immediately, since such pointers are not initialized. See also _PyObject_IsFreed() and _PyMem_IsPtrFreed() functions.

@markshannon
Copy link
Member

@kmod is this obsolete now that we are using _PyTuple_FromArraySteal() in BUILD_TUPLE?

@kmod
Copy link
Contributor Author

kmod commented Oct 3, 2022

Yeah it looks like #96516 was 80-90% of the benefit of this PR, so I think this isn't worth it anymore.

@kmod kmod closed this Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.