-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add _PyTuple_New_Nonzeroed #96446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _PyTuple_New_Nonzeroed #96446
Conversation
A faster, but unsafe, version of PyTuple_New for use in performance- critical paths. Zeroing the contents of a tuple is surprisingly expensive, so for cases where the contents of the tuple will immediately be overwritten, time can be saved by calling this non-zeroing version. This tuple is not safe to be seen by any other code since it will have garbage where PyObject pointers are expected. If there is any chance that the initialization code will raise an exception or do a GC collection, it is not safe to use this function. perf says that approximately 0.75% of the time in the pyperformance benchmarks is spent in PyTuple_New, and 0.42% for the Pyston web macrobenchmarks. It also says that roughly 35% of the time of PyTuple_New is spent in the loop that zeroes the tuple contents, so if we are able to remove this then we can save 0.26% / 0.15% respectively. I went through all calls to PyTuple_New and converted them if it was safe to do so and if it seemed reasonable that the code could be performance-sensitive. This migrates roughly 90% of the PyTuple_New time to _PyTuple_New_Nonzeroed. Changes this small are difficult to measure, but looking at the new macrobenchmarks profile I see that PyTuple_New + _PyTuple_New_Nonzeroed time is 0.26%, down from 0.45%. pyperformance and Pyston-macrobenchmarks performance both improved by about 0.5% but I don't trust those numbers.
Can you please provide benchmarks showing the benefits of this change? I'm not really excited by adding an unsafe private API for performance. Why not adding a fast public safe API instead? There is If we decide to add an unsafe function, I suggest to put |
if (op == NULL) { | ||
return NULL; | ||
} | ||
_PyObject_GC_TRACK(op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bad practice: uninitialized tuples must be not tracked by the GC, it should only be tracked once all items are initialized. See #59313 I added _PyType_AllocNoTrack() to attempt to fix a similar issue in a few functions.
PyTuple_Pack() doesn't have this design flaw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an internal _PyTuple_AllocNoTrack() function might make sense: the caller would have to call _PyObject_GC_TRACK() on the tuple once it's initialized.
If PyTuple_Pack() is inefficient and performance is so critical that PyTuple_Pack() code to handle various tuple lengths is an issue, maybe we should add functions specialized for some tuple lenghts? For the internal C API, it's acceptable. Like:
tuple = _PyTuple_Pack1(item1);
tuple = _PyTuple_Pack2(item1, item2);
tuple = _PyTuple_Pack3(item1, item2);
...
A problem of PyTuple_Pack() is that it creates new strong references rather than stealing references. Maybe we need a variant stealing references for the BUILD_TUPLE bytecode for example?
Hi Victor, could you clarify what benchmark results you would like beyond the benchmarks that I included? Unfortunately TARGET(BUILD_TUPLE) {
PyObject *tup = PyTuple_New(oparg);
if (tup == NULL)
goto error;
while (--oparg >= 0) {
PyObject *item = POP();
PyTuple_SET_ITEM(tup, oparg, item);
}
PUSH(tup);
DISPATCH();
} I'm definitely open to renaming the function. Personally I think that if it's already unsafe to do arbitrary code before this tuple is initialized, adding a second kind of unsafety (GC tracking) doesn't feel any worse to me. But if you feel strongly I'm happy to move GC-tracking to be the callers responsibility for this function. |
For example, write a C extension (or add a function to the _testcapi extension) which creates a tuple and delete it in a loop. You can use pyperf with the bench_time_func() method: https://pyperf.readthedocs.io/en/latest/examples.html#bench-time-func-method Write one function using PyTuple_New(), a second function using _PyTuple_New_Nonzeroed(). Example: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 2d4c73cfe9..017a29023a 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5459,6 +5459,32 @@ test_macros(PyObject *self, PyObject *Py_UNUSED(args))
}
+static PyObject*
+bench1(PyObject *self, PyObject *args)
+{
+ int i, loops;
+ if (!PyArg_ParseTuple(args, "i", &loops)) {
+ return NULL;
+ }
+
+ PyObject *item1 = Py_None;
+ PyObject *item2 = Py_False;
+
+ _PyTime_t t1 = _PyTime_GetPerfCounter();
+ for (i=0; i < loops; i++) {
+ Py_INCREF(item1);
+ Py_INCREF(item2);
+
+ PyObject *obj = PyTuple_New(2);
+ PyTuple_SET_ITEM(obj, 0, item1);
+ PyTuple_SET_ITEM(obj, 0, item2);
+ Py_DECREF(obj);
+ }
+ _PyTime_t t2 = _PyTime_GetPerfCounter();
+ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
static PyObject *getargs_s_hash_int2(PyObject *, PyObject *, PyObject*);
@@ -5734,6 +5760,7 @@ static PyMethodDef TestMethods[] = {
{"settrace_to_record", settrace_to_record, METH_O, NULL},
{"test_macros", test_macros, METH_NOARGS, NULL},
{"clear_managed_dict", clear_managed_dict, METH_O, NULL},
+ {"bench1", bench1, METH_VARARGS, NULL},
{NULL, NULL} /* sentinel */
};
With bench_new.py:
and bench_new_nonzero.py:
Compare with:
I didn't try all these commands, I let you fill the blanks :-) |
I would say that the equivalent numbers that I provided from a realistic workload are closer to what we actually care about, but nevertheless here are the numbers that you asked for:
|
@serhiy-storchaka @methane @pablogsal @gvanrossum: What do you think of this micro-optimization? Is it worth it? |
Oh thanks for the micro-benchmark! That's an useful data point, even if the most important one remains pyperformance results. |
I am fine with it as long as this function remains private. I think the reasoning behind the API is sound and makes sense. It would be great to double check the benchmarks bit the numbers sound reasonable. |
If the NULL loop is a problem, have you tested how memset() performs here?
|
@vstinner wrote:
I think it might be worth seeing what the performance of a borrowing version of It still wouldn't address the dynamic N cases, of course. For the |
There is already the _PyTuple_FromArraySteal() function for that which is the same kind of optimization than this PR. |
An API that creates an incomplete object is a bad API. We should make |
Some context here is that we have quite a few of these optimizations that we are upstreaming from Pyston, so although each change is individually ~0.2% and easy to dismiss on its own, in aggregate we hope to get similar to the multiple percentage point improvement we got in 3.8. Unfortunately they probably all will come with an increase in complexity / decrease in elegance. @tiran I believe the compiler is able to make this transformation, and because it didn't it decided it wasn't profitable, but I'll go ahead and test it I'm still in favor of this change as is, but I'll go back through and see if there is a less-general change that will work for enough of the hotspots. (I thought the BUILD_TUPLE code was putting them in backwards so the FromArraySteal variant wouldn't work, and I'll see about some others.) |
@kmod wrote:
Indeed, you are correct. I think this PR reveals the shortcoming of |
I’m with Pablo, it is fine as a private API. Maybe consider a scarier name? |
This API is very unsafe. If any errors happens before the tuple has been fully initialized, you will decref a tuple which contains garbage. At least one site in Many years ago I did some research, and the half of the overhead of using a new tuple instead of reusing the same tuple was in the tuple deallocation. You cannot consider tuple creation without tuple deallocation, and there is no shortcuts in the latter, you have to loop and deref all items. What is worse, the trashcan mechanism adds additional overhead. 35% of the time of PyTuple_New is much smaller fraction of the full life cycle of the loop. Taking into account a large danger of introducing new bugs, such API can be used only in limited cases. And every use should be surrounded by extensive warning comments. And for every particular case we need a microbenchmark which shows that there is a benefit of this optimization. Actually, adding new code can slowdown other parts of the code, because the CPU needs to cache more code now. It also adds maintenance burden. |
What about naming it to |
I can not reproduce 1.8x performance improvement. @kmod Would you share your benchmark code? I used @vstinner's patch. Here is the patch: https://gist.github.com/methane/1d5978319a0aa77823ef3d25ea78a177
|
@tiran It seems gcc use |
Regarding the safety of this proposed API, I have a possibly hare-brained suggestion (I'm new here). What if the tuple were initially filled with some sort of sentinel object when in debug mode, and that object asserted on every possible operation (especially deallocating). That would make it much more likely to catch any of the errors that @serhiy-storchaka is concerned about when running the unit tests on a debug build. Regarding the questions earlier of "why not just use PyTuple_Pack", I have some data to report. I made two branches:
I also ran this PR on the same benchmarking hardware to make sure we're comparing apples to apples.
These diffs are really tiny. For this PR, the improvement is a bit smaller than @kmod's original report of 0.5%, which is probably due to hardware differences, and it's not surprising that they don't exactly match at this level. It's a bit hard to reach any solid conclusions, but I think the following is safe to say:
|
PyTuple_New will zero out the tuple before returning to the caller, and a surprising amount of time can be saved by not doing this nonzeroing. One option is to add a non-zeroing version of PyTuple_New, which I did in python#96446, but there was resistance to the unsafety of it. Fortunately it looks like most of the tuple-zeroing happens directly from the BUILD_TUPLE opcode in the interpreter, which already has the arguments in an appropriate array, so we can just convert this to _PyTuple_FromArraySteal This seems to result in a ~0.2% speedup on macrobenchmarks.
PyTuple_New will zero out the tuple before returning to the caller, and a surprising amount of time can be saved by not doing this nonzeroing. One option is to add a non-zeroing version of PyTuple_New, which I did in python#96446, but there was resistance to the unsafety of it. Fortunately it looks like most of the tuple-zeroing happens directly from the BUILD_TUPLE opcode in the interpreter, which already has the arguments in an appropriate array, so we can just convert this to _PyTuple_FromArraySteal This seems to result in a ~0.2% speedup on macrobenchmarks.
PyTuple_New will zero out the tuple before returning to the caller, and a surprising amount of time can be saved by not doing this zeroing. One option is to add a non-zeroing version of PyTuple_New, which I did in python#96446, but there was resistance to the unsafety of it. Fortunately it looks like most of the tuple-zeroing happens directly from the BUILD_TUPLE opcode in the interpreter, which already has the arguments in an appropriate array, so we can just convert this to _PyTuple_FromArraySteal This seems to result in a ~0.2% speedup on macrobenchmarks.
Here's a branch with the benchmark: https://github.com/kmod/cpython/tree/tuple_new_nonzeroed_bench I got the same results on the machine I originally tested on:
But got pretty different results on my development machine:
So maybe this is fairly microarchitecture-dependent? Also it looks like I was wrong about PyTuple_Pack, the benefits of not zeroing it out seem to outweigh the varargs cost. |
I looked at it and essentially all of the benefit of this PR comes from changing the BUILD_TUPLE opcode, with the other locations not really mattering too much, so maybe we can scope this to just that call. When I said "I thought BUILD_TUPLE was putting them in backwards" I should have said "incorrectly thought" -- it looks like it puts them forwards so we can use the FromArraySteal variant. I put up a new PR (#96516) with only that change. I still don't fully trust the measurements at this level, but I got a speedup of 0.4% on macrobenchmarks. |
To close the loop on my above comment, using I also kicked off a run of #96516. |
In debug mode, the Python memory allocators do exactly that: https://docs.python.org/dev/c-api/memory.html#pymem-debug-hooks Deallocating such tuple will likely crash immediately, since such pointers are not initialized. See also _PyObject_IsFreed() and _PyMem_IsPtrFreed() functions. |
@kmod is this obsolete now that we are using |
Yeah it looks like #96516 was 80-90% of the benefit of this PR, so I think this isn't worth it anymore. |
A faster, but unsafe, version of PyTuple_New for use in performance-
critical paths.
Zeroing the contents of a tuple is surprisingly expensive, so for
cases where the contents of the tuple will immediately be overwritten,
time can be saved by calling this non-zeroing version.
This tuple is not safe to be seen by any other code since it will
have garbage where PyObject pointers are expected. If there is any
chance that the initialization code will raise an exception or do
a GC collection, it is not safe to use this function.
perf says that approximately 0.75% of the time in the pyperformance benchmarks
is spent in PyTuple_New, and 0.42% for the Pyston web macrobenchmarks. It
also says that roughly 35% of the time of PyTuple_New is spent in the loop that
zeroes the tuple contents, so if we are able to remove this then we can save
0.26% / 0.15% respectively.
I went through all calls to PyTuple_New and converted them if it was safe to do
so and if it seemed reasonable that the code could be performance-sensitive.
This migrates roughly 90% of the PyTuple_New time to _PyTuple_New_Nonzeroed.
Changes this small are difficult to measure, but looking at the new
macrobenchmarks profile I see that PyTuple_New + _PyTuple_New_Nonzeroed time is
0.26%, down from 0.45%. pyperformance and Pyston-macrobenchmarks performance
both improved by about 0.5% but I don't trust those numbers.