Skip to content

Polynomial.py Optimization #4610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
odysseus9672 opened this issue Apr 10, 2014 · 12 comments · Fixed by #5031
Closed

Polynomial.py Optimization #4610

odysseus9672 opened this issue Apr 10, 2014 · 12 comments · Fixed by #5031

Comments

@odysseus9672
Copy link

polyval in Polynomial.py is about 5 times slower than an implementation based on reduce. I would suggest replacing the following lines (672 and 673):
for i in range(len(p)):
y = x * y + p[i]
return y
with:
return reduce( lambda y, c: y * x + c, p )

@odysseus9672
Copy link
Author

Here's some code to demonstrate the speed advantage (evaluates a polynomial version of exp at pi):

NumCalls = 1<<20 #1 million, binary
setupStr = ( "import numpy as np\n" +
"import scipy.special as spsf\n" +
"c = 1.0 / spsf.gamma(np.arange(1, 65))\n" +
"x=3.141592653589793" )
timeit.timeit( "reduce(lambda y, a: y*x + a, c)", setup=setupStr, number= NumCalls ) / float(NumCalls)
timeit.timeit( "np.polyval( c, x)", setup=setupStr, number= NumCalls ) / float(NumCalls)

@filmor
Copy link

filmor commented Aug 27, 2014

I second this. Is there any particular reason to use a bare for-loop?

@juliantaylor
Copy link
Contributor

there are some complications with subclasses as far as I remember, the test coverage of this function is very poor. Just using reduce like will break stuff, unfortunately I forgot what exactly...

@charris
Copy link
Member

charris commented Aug 28, 2014

Note that numpy.polynomial.polynomial.polyval also uses a loop, but seems faster:

In [5]: timeit polyval1(x,c[::-1])
100000 loops, best of 3: 13.3 us per loop

In [6]: timeit polyval(c,x)
10000 loops, best of 3: 74.3 us per loop

In [7]: timeit reduce(lambda y, a: y*x + a, c)
100000 loops, best of 3: 13.7 us per loop

Where the first is the numpy/polynomial/polynomial version, the second is numpy.polyval, and the third is reduce. Not sure why numpy.polyval is so slow, but it may not be the loop.

@filmor
Copy link

filmor commented Aug 29, 2014

After some more testing (i.e. increasing the coefficient count to absurd numbers) I'd also say that it's indeed not the loop but instead the initialisation overhead that leads to the differences. Still, having two subtly different functions in the same library is a problem IMHO.

@charris
Copy link
Member

charris commented Sep 2, 2014

On Fri, Aug 29, 2014 at 3:58 AM, Benedikt Sauer notifications@github.com
wrote:

After some more testing (i.e. increasing the coefficient count to absurd
numbers) I'd also say that it's indeed not the loop but instead the
initialisation overhead that leads to the differences. Still, having two
subtly different functions in the same library is a problem IMHO.

Turns out the difference in speed is y * x vs x * y. That's rather
subtle, and I suspect the reason is __mul__ vs __rmul__, but that is
just a guess.

Chuck

@charris
Copy link
Member

charris commented Sep 2, 2014

On Mon, Sep 1, 2014 at 7:40 PM, Charles R Harris charlesr.harris@gmail.com
wrote:

On Fri, Aug 29, 2014 at 3:58 AM, Benedikt Sauer notifications@github.com
wrote:

After some more testing (i.e. increasing the coefficient count to absurd
numbers) I'd also say that it's indeed not the loop but instead the
initialisation overhead that leads to the differences. Still, having two
subtly different functions in the same library is a problem IMHO.

Turns out the difference in speed is y * x vs x * y. That's rather
subtle, and I suspect the reason is __mul__ vs __rmul__, but that is
just a guess.

Note that in this case y is a numpy scalar after the first iteration and x
is an array. I think there is something not quite ideal in the way we are
handling that situation.

Chuck

@charris
Copy link
Member

charris commented Sep 2, 2014

On Mon, Sep 1, 2014 at 8:11 PM, Charles R Harris charlesr.harris@gmail.com
wrote:

On Mon, Sep 1, 2014 at 7:40 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:

On Fri, Aug 29, 2014 at 3:58 AM, Benedikt Sauer <notifications@github.com

wrote:

After some more testing (i.e. increasing the coefficient count to absurd
numbers) I'd also say that it's indeed not the loop but instead the
initialisation overhead that leads to the differences. Still, having two
subtly different functions in the same library is a problem IMHO.

Turns out the difference in speed is y * x vs x * y. That's rather
subtle, and I suspect the reason is __mul__ vs __rmul__, but that is
just a guess.

Note that in this case y is a numpy scalar after the first iteration and x
is an array. I think there is something not quite ideal in the way we are
handling that situation.

It's a factor of 5 difference in speed, so we should figure out how to
make array * scalar faster.

Chuck

@filmor
Copy link

filmor commented Sep 2, 2014

I can't reproduce this. What code did you use to show that array.__mul__ vs. scalar.__rmul__ (or vice-versa) is the problem here?

@charris
Copy link
Member

charris commented Sep 2, 2014

I reversed the order and timed it. It not __rmul__ vs __mul__, but just scalar.__mul__ vs array.__mul__.

In [8]: a = np.float64(123)

In [9]: b = np.array(4.)

In [10]: timeit a*b
10000000 loops, best of 3: 110 ns per loop

In [11]: timeit b*a
1000000 loops, best of 3: 985 ns per loop

@charris
Copy link
Member

charris commented Sep 2, 2014

For completeness

In [12]: timeit a*a
10000000 loops, best of 3: 66.6 ns per loop

In [13]: timeit b*b
1000000 loops, best of 3: 349 ns per loop

In [14]: timeit a + b
10000000 loops, best of 3: 109 ns per loop

In [15]: timeit b + a
1000000 loops, best of 3: 1.01 µs per loop

@charris
Copy link
Member

charris commented Sep 2, 2014

Note that python scalars do go the __rmul__ route.

In [5]: a = 123.

In [6]: timeit a*b
1000000 loops, best of 3: 515 ns per loop

In [7]: timeit b*a
1000000 loops, best of 3: 512 ns per loop

charris added a commit to charris/numpy that referenced this issue Sep 2, 2014
Multiplying a numpy_scalar times a numpy_array is much faster than
the other way around. This PR switches the order of multiplication
in the polyval function resulting in a speedup of about 5x for scalar
values of x.

Closes numpy#4610.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants