Skip to content

MAINT: Improve speed of ufunc kwargs parsing #11351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2018

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Jun 16, 2018

This is an alternative to #11333, which doesn't use the enum and includes a drastic rewrite in terms of conversion functions instead of the enormous switch/case loop (except for "out"). The reductions in overhead are the same as with #11333 to within a few ns (see below).

@eric-wieser - what do you think? My own sense is that this will be much more easily used also for reduce, etc., as it follows the python parsing ideas more closely.

EDIT: added reductions in overhead from #11333, since I closed that one.

import numpy as np
a = np.array(1)
b = np.empty_like(a)
%timeit np.positive(a)                                 # 352->348 ns
%timeit np.positive(a, subok=True)                     # 606->501 ns
%timeit np.positive(a, where=True)                     # 586->503 ns
%timeit np.positive(a, where=True, subok=True)         # 695->531 ns
%timeit np.positive(a, b)                              # 354->352 ns
%timeit np.positive(a, out=b)                          # 557->480 ns
%timeit np.positive(a, out=b, subok=True)              # 668->506 ns
%timeit np.positive(a, out=b, where=True, subok=True)  # 752->536 ns

@mhvk mhvk requested a review from eric-wieser June 16, 2018 00:29
@mhvk mhvk changed the title MAINT: Improve speed of ufunc kwargs parsing. MAINT: Improve speed of ufunc kwargs parsing (alternative). Jun 16, 2018
(++kwarg)->output = out_subok;
(++kwarg)->output = out_typetup;
(++kwarg)->output = &sig;
(++kwarg)->output = out_extobj;
Copy link
Member

@eric-wieser eric-wieser Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block isn't threadsafe (since ufunc_kwargs is static)

How much does performance change if you make it non-static?

Really though, I'm not sure this gains anything over a parallel array of outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I had not thought of that... I actually had a parallel array before, and at the time it seemed less nice. But reverting, I actually think it is much nicer, so I adopted it.

{NULL, NULL, NULL} /* sentinel */
};
ufunc_kwarg *kwarg = ufunc_kwargs;
if (kwarg->str == NULL) { /* fill on first call */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding that this is because the strings are not interned yet during static initialization.

You could store &npy_um_str_out etc into the array instead, which would work fine during static initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, better indeed.


PyArray_Descr *dtype = NULL;
PyObject *sig = NULL;
typedef int (*converter)(PyObject *, void *);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often like to do my function typedefs as

typedef int converter(PyObject *, void *)

And then use converter *convert below, which reads as "a pointer to a converter function"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mine from the CPython implementation, but that definitely looks nicer, so I changed it.

@eric-wieser
Copy link
Member

I wonder if you should just make a va_list function, which you could call as

parse_kwargs_fast(kwargs,
    interned_name, parser, target_var,
    interned_name, parser, target_var,
    ...
    NULL)

I know very little about va_list, but my understanding is that it's really just a bunch of helpers to walk through the stack

@mhvk
Copy link
Contributor Author

mhvk commented Jun 16, 2018

@eric-wieser - I followed your suggestions and think it looks quite a bit better still. Do you agree that this is a better solution than #11333? If so, I'll close that one.

On the va_list, I did think of making effectively a sped-up version of PyArg_ParseArgsAndKeywords, but that may be better done in a separate PR. Also, I think for this routine, we may want to do out before it even gets called (since it is needed even if the ufunc is overridden).

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from f34960a to d41caa9 Compare June 16, 2018 17:17
@mattip
Copy link
Member

mattip commented Jun 19, 2018

I see #113333 is still open, but this is the desired alternative? Can we merge this and close the other?

@mhvk
Copy link
Contributor Author

mhvk commented Jun 19, 2018

I definitely think this is the better solution -- just wasn't sure whether there were further comments.

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from d41caa9 to 82e02c6 Compare June 21, 2018 14:39
str_key_obj = PyUnicode_AsASCIIString(key);
if (str_key_obj != NULL) {
key = str_key_obj;
while (PyDict_Next(kwds, &pos, &key, &value)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could almost become a fast keyword parsing function that we could use also for the reductions. But as is, this part of code is not able to DECREF items on failure, so I'd rather leave that for another PR (it needs changing the converters such that if passed NULL, they decref if needed).


typedef int converter(PyObject *, void *);
typedef struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not completely sold on this new struct - its main advantage of keeping the names and converters together perhaps doesn't outweigh adding it -- but for use in just one function it is probably fine (and another reason for not yet aiming to make a general keyword parsing function).

@mhvk
Copy link
Contributor Author

mhvk commented Jun 21, 2018

Rebased, and refactored a little more. There are some obvious further improvements possible, but I'd prefer to leave those for a future PR (see in-line comment).

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from 82e02c6 to 62a842f Compare June 21, 2018 17:51
@mhvk mhvk changed the title MAINT: Improve speed of ufunc kwargs parsing (alternative). MAINT: Improve speed of ufunc kwargs parsing Jun 21, 2018
@mhvk
Copy link
Contributor Author

mhvk commented Jun 27, 2018

@eric-wieser - sorry to bug for two PRs, but this one would also be nice to get in...

@eric-wieser eric-wieser self-assigned this Jun 27, 2018
{NULL, NULL} /* sentinel */
};
/*
* Some converters are from the Array_API, which are not availabe'
Copy link
Member

@eric-wieser eric-wieser Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be true once umath and multiarray are merged, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

outputs[9] = out_typetup;
outputs[10] = &sig;
outputs[11] = out_extobj;
outputs[12] = NULL; /* sentinel for key-not-found */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this order duplicated between here and above seems very fragile. I'll see if I can think of another way

output = outputs[kwarg - ufunc_kwargs];
if (output) {
if (kwarg->convert) {
if (!(*(kwarg->convert))(value, output)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: No need to dereference the function pointer here - *fp() and fp() are synonymous in C. I'm fine with this line as is though

bad_arg = 0;
}
else {
/* Just copy reference */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is special casing NULL here really better than just having one more function pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else {
#if PY_VERSION_HEX >= 0x03000000
PyErr_Format(PyExc_TypeError,
"'%S' is an invalid keyword to ufunc '%s'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, CPython3x format strings have a special syntax for unicode? I think there's a lot of places that cleanup could go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'S' is equivalent to str, and, yes, I was very happy when I found it (and then sad that it didn't work on python 2... fortunately, not much longer.)

* And we do indeed intern them.
*/
void *output;
ufunc_kwarg *kwarg = ufunc_kwargs;
Copy link
Member

@eric-wieser eric-wieser Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract this line to line 787 as a helper function, ufunc_kwarg *find_ufunc_kwd(PyObject *key), which returns null if nothing is found?

That would make it easier to swap it for a pointer-ordered binary search in future, if necessary.

You'd have to put the struct and static array definition at global scope, but I think that's fine.

@@ -11,12 +11,22 @@ NPY_NO_EXPORT const char*
ufunc_get_name_cstr(PyUFuncObject *ufunc);

/* interned strings (on umath import) */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add from umathmodule.c to this comment?

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from 62a842f to b292dcb Compare June 28, 2018 19:41
return -1;
}
if (kwnames[index]) {
va_start(va, kwnames);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this way of repeatedly going through the va_list is OK generically, but I guess the tests will tell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're fine:

void va_start (va_list ap, paramN);
If ap has already been passed as first argument to a previous call to va_start or va_copy, it shall be passed to va_end before calling this function.

if (pos != 0) {
Py_ssize_t repeat = 0;
while (PyDict_Next(kwds, &repeat, &key, &value) && repeat != pos) {
int i;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, some repetition here with the above. Not quite sure how to solve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could encapsulate the loop::

for (i = 0; i <= index; i++) {
    convert = va_arg(va, converter *);
    output = va_arg(va, void *);
}

But arguably not worth it.

@mattip
Copy link
Member

mattip commented Jun 28, 2018

Maybe I missed it, but did you try replacing this with PyArg_ParseArgsAndKeywords ? It seems you are moving in the direction of duplicating much of that code

@mhvk
Copy link
Contributor Author

mhvk commented Jun 28, 2018

@mattip - the problem is that PyArg_ParseArgsAndKeywords is very slow, in part because it turns every character keyword into an object, in part (IIRC) because it searches for keywords in the dict instead of scanning over the dict. But, yes, effectively we're emulating CPython code (the parts that deal with function calls); sadly, that is not exposed.

@eric-wieser
Copy link
Member

@mhvk: Do we have a benchmark that proves that?

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from 6ed3bfc to 6d42d80 Compare June 28, 2018 20:44
@mhvk
Copy link
Contributor Author

mhvk commented Jun 28, 2018

@eric-wieser, @mattip - one can check with the reduction methods (which do use PyArg_ParseArgsAndKeywords):

In [17]: a = np.arange(10.)

In [18]: %timeit np.add.reduce(a, 0)
The slowest run took 24.99 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.11 µs per loop

In [19]: %timeit np.add.reduce(a, axis=0)
The slowest run took 19.99 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.36 µs per loop

@eric-wieser
Copy link
Member

eric-wieser commented Jun 28, 2018

Can you add (in a new PR) a benchmark that compares:

np.add(a, b, out=c)
np.add(a, b, c)
np.add(a, b, out=(c,))

for some small arrays? Preferably in some way that asv will show them all on the same graph

@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from 6d42d80 to fc5f537 Compare June 28, 2018 23:23
@mhvk
Copy link
Contributor Author

mhvk commented Jun 28, 2018

I pushed a version that has the benchmarks as the first commit. I tried installing asv and it did run with the simple ./runtest.py --bench command, but nothing else in bench/README.rst just works (no comparisons, no publishing). I don't really have time to debug this, though, and would prefer to focus on whether the PR itself makes sense (I feel it is an improvement over the previous parsing even ignoring the speed gain!).

@@ -148,3 +148,62 @@ def time_add_scalar_conv(self):

def time_add_scalar_conv_complex(self):
(self.y + self.z)


class ArgParsing(Benchmark):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like these in a separate PR, simply because the hosted asv runs nightly, so would not produce a before / after if they get merged together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, see #11453

}
va_end(va);
if (output) {
convert(NULL, output);
Copy link
Member

@eric-wieser eric-wieser Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll get a system error here if convert tries to do any python stuff since there's an exception in flight. Possibly worth Fetching the exception, and checking the error code here, just in case.

* Note that if any of the out_* is NULL, parse_ufunc_keywords
* will treat the keyword as unwanted.
*/
if (parse_ufunc_keywords(
Copy link
Member

@eric-wieser eric-wieser Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the definition of kwnames to just above here, so that the two things that have to be in the same order are at least adjacent?

static variables can be declared at any scope, not just function scope, IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes much more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I rewrote as

static *kwnames[13] = {NULL}
if (kwnames[0] == NULL) {
    kwnames[0] = npy_um_...
...
}

so that I don't have the triple ***kwnames any more. Should be just that little bit faster.

if (_convert_casting == NULL) {
_convert_casting = (converter *)PyArray_CastingConverter;
_convert_order = (converter *)PyArray_OrderConverter;
_convert_dtype = (converter *)PyArray_DescrConverter2;
Copy link
Member

@eric-wieser eric-wieser Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really worth it vs just passing these functions directly into the parse_ufunc_keywords call below? You're just making copies of the pointer here, which at best might provide marginally improved cache locality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, yes, of course that should work! Indeed, much clearer, especially with the keywords there too.

_convert_dtype = (converter *)PyArray_DescrConverter2;
};
/*
* Initialize output objects so caller knows when outputs and optional
* arguments are set (also means we can safely XDECREF on failure).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now, or does parse_ufunc_keywords handle that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, still necessary. The parser only touches parts that are actually in kwds

return NPY_FAIL;
}
}
else if (obj == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is never hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is if the parser is calling the function again in the failure path to undo an earlier initialization, i.e., in this case when, say, the first keyword was where and the second keyword something that is not allowed.

This behaviour is the same as for PyArg_ParseArgsAndKeywords

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. Added a note about the converters to make this clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But looking more closely at the array ones, I'm not so sure they are actually written to follow the logic... And indeed, reading the documentation [1] more carefully, a function has to return Py_CLEANUP_SUPPORTED and this only works in python 3. I'll rewrite accordingly.

[1] https://docs.python.org/3/c-api/arg.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant here is that you had your ifs the wrong way around.

if (obj != PyTrue) {
}
else if (obj == NULL) {
    // but this can never happen, because PyTrue != NULL
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that was silly!

NPY_NO_EXPORT int
_convert_keepdims(PyObject *obj, int *keepdims)
{
if (PyBool_Check(obj)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is safe if obj == NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought surely it was but looking at the actual definition, it clearly is not.

Using the realization from inspecting similar CPython code that as we
intern string names anyway, comparing keys with possible names by
pointer will generally just work.

Also rather drastically rewrote, to use a new parser with simple
conversion functions to replace part of the enormous switch/case
structure, process out separately, and check for duplication after the
parsing.

Reduces the overhead as follows:
```
import numpy as np
a = np.array(1)
b = np.empty_like(a)
%timeit np.positive(a)                                 # 352->348 ns
%timeit np.positive(a, subok=True)                     # 606->501 ns
%timeit np.positive(a, where=True)                     # 586->503 ns
%timeit np.positive(a, where=True, subok=True)         # 695->531 ns
%timeit np.positive(a, b)                              # 354->352 ns
%timeit np.positive(a, out=b)                          # 557->480 ns
%timeit np.positive(a, out=b, subok=True)              # 668->506 ns
%timeit np.positive(a, out=b, where=True, subok=True)  # 752->536 ns
```
@mhvk mhvk force-pushed the ufunc-parsing-speedup-alternative branch from fc5f537 to f93a496 Compare June 29, 2018 13:51
@mhvk
Copy link
Contributor Author

mhvk commented Jun 29, 2018

@eric-wieser - thanks for the (as usual) very good comments. Now having looked more at the normal parsers, it is clear that the cleanup is better not done inside it; being a bit more judicious with borrowed and new references allowed the code to become substantially simpler. I'm now sufficiently convinced this is the right path that I squashed all commits. In a follow-up commit, I'll use this parser for the reduction methods (which may need a bit of tinkering with the keyword-not-found handler, but should otherwise be very straightforward).

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good now - thanks for showing me a few nifty py3 c api features, even if we can't use them yet...

@eric-wieser
Copy link
Member

Let's not merge this until the benchmark suite picks up your new benchmarks

@eric-wieser
Copy link
Member

Alright, let's see how the benchmarks improve

@mhvk mhvk deleted the ufunc-parsing-speedup-alternative branch July 2, 2018 17:10
@mhvk
Copy link
Contributor Author

mhvk commented Jul 2, 2018

Great!

@eric-wieser
Copy link
Member

eric-wieser commented Jul 10, 2018

Doesn't seem to have made any real difference (orange line is this commit)

image

@mhvk
Copy link
Contributor Author

mhvk commented Jul 10, 2018

Very strange, I just checked and I do still get a speed-up, but it is less dramatic than I had in top, mostly because the just-before commit is faster (new numbers, from just before the commit to current master):

import numpy as np
a = np.array(1)
b = np.empty_like(a)
%timeit np.positive(a)                                 # 352->348 ns  # 313->320 ns
%timeit np.positive(a, subok=True)                     # 606->501 ns  # 529->483 ns
%timeit np.positive(a, where=True)                     # 586->503 ns  # 500->473 ns
%timeit np.positive(a, where=True, subok=True)         # 695->531 ns  # 595->515 ns
%timeit np.positive(a, b)                              # 354->352 ns  # 330->330 ns
%timeit np.positive(a, out=b)                          # 557->480 ns  # 504->455 ns
%timeit np.positive(a, out=b, subok=True)              # 668->506 ns  # 573->495 ns
%timeit np.positive(a, out=b, where=True, subok=True)  # 752->536 ns  # 637->526 ns

O well, at least the code is clearer now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants