Skip to content

BUG: performance regression of polynomial evaluation #26843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PeterWurmsdobler opened this issue Jul 3, 2024 · 12 comments
Closed

BUG: performance regression of polynomial evaluation #26843

PeterWurmsdobler opened this issue Jul 3, 2024 · 12 comments
Labels

Comments

@PeterWurmsdobler
Copy link

Describe the issue:

The evaluation of a polynomial of 2nd order takes about 50% more time when using numpy 2.0.0 in comparison to numpy version 1.26.4 which was revealed by running asv.

Reproduce the code example:

import numpy as np
from numpy.polynomial import Polynomial

class PolynomialBenchmark:
    def setup(self) -> None:
        self.polynomial = Polynomial([10, 2, 6])
        self.p_points = np.linspace(0.0, 1.0, num=1000)

    def time_eval(self) -> None:
        for p in self.p_points:
            self.polynomial(p)

Error message:

Change   | numpy 1.26.4  | numpy 2.0.0   | Ratio   | Benchmark (Parameter)                                                                                    |---------|---------------|---------------|---------|-------------------------------|
|+        | 2.67±0.03ms   | 4.09±0.06ms   | 1.53    | PolynomialBenchmark.time_eval |

Python and NumPy Versions:

2.0.0
3.11.8 (main, Apr 29 2024, 16:52:56) [GCC 11.4.0]

Runtime Environment:

Python 3.11.8 (main, Apr 29 2024, 16:52:56) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import numpy; numpy.show_runtime()
[{'numpy_version': '2.0.0',
'python': '3.11.8 (main, Apr 29 2024, 16:52:56) [GCC 11.4.0]',
'uname': uname_result(system='Linux', node='DE-C-001VH', release='5.15.0-113-generic', version='#123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': '/home/peter/project/.env311/lib/python3.11/site-packages/numpy.libs/libscipy_openblas64_-99b71e71.so',
'internal_api': 'openblas',
'num_threads': 24,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}]

Context for the issue:

The polynomial evaluation is part of a larger project of using cubic splines as well as scipy numerical integration and root finding where the polynomial is evaluated repeatedly. The overall performance of the project suffers a 25% execution increase.

@charris
Copy link
Member

charris commented Jul 3, 2024

Hmm. I suspect the dtype/casting changes here, the code itself hasn't changed. Does it make a difference if you use float coefficients instead of integers?

@charris
Copy link
Member

charris commented Jul 3, 2024

There is also a lot asarray type stuff, which may dominate for degree 2 polynomials. Another thing to check might be looking at the time difference for higher degree polynomials.

@PeterWurmsdobler
Copy link
Author

Hmm. I suspect the dtype/casting changes here, the code itself hasn't changed. Does it make a difference if you use float coefficients instead of integers?

Tried with both int and float coefficients, does not make a difference.

@PeterWurmsdobler
Copy link
Author

There is also a lot asarray type stuff, which may dominate for degree 2 polynomials. Another thing to check might be looking at the time difference for higher degree polynomials.

I am not quite sure I understand the first part of the sentence about the asarray issue. As for the 2nd order polynomial, this and 3rd order are the polynomials being used and no higher order polynomials are needed. Perhaps it would be faster to use an explicit version of the 2nd order polynomial.

@eendebakpt
Copy link
Contributor

@charris Actually, the code has changed. Compare

https://github.com/numpy/numpy/blob/maintenance/1.26.x/numpy/polynomial/_polybase.py#L510

with

https://github.com/numpy/numpy/blob/main/numpy/polynomial/_polybase.py#L525

I will look into whether this is the cause of the regression or not

@eendebakpt
Copy link
Contributor

@PeterWurmsdobler First of all: numpy performs well on large matrices, but not so well on small arrays. So if you could vectorize your code, e.g. call polynomial(p_points) instead of a for loop over single data points that would greatly improve performance.

In your minimal example the call self.polynomial(p) takes about 1.5 microseconds on my system (numpy 1.26.x). For numpy 2.0 some overhead is added:

  • The call to pu.mapdomain is almost the same code, but the extra function call adds some overhead (~200 ns, which is typical for python function calls, but it noticable here)
  • In mapdomain there is an additional call to np.asarray which adds ~150 ns
  • After the call to np.asarray the argument p (which was type np.float64) becomes a numpy array of dtype=float64. The expression off + scl*x is much faster for scalars (used in numpy 1.26) than for arrays (which are used in numpy 2.0 due to the np.asarray). This adds again a couple of hundreds of ns.

I think this explain the performance regression you are seeing.

The changes that have been made are to resolve #17949. We could partly restore performance by replacing x = np.asanyarray(arg) with something like:

if isinstance(arg, numpy.generic):
   x = arg
else:
   x = np.asanyarray(arg)

(preferably implemented in C, not sure whether it already exists in numpy). I did not check whether such a change would pass unit tests.

@charris Would adding special casing such as in the example above be acceptable?

@PeterWurmsdobler
Copy link
Author

PeterWurmsdobler commented Jul 5, 2024

@PeterWurmsdobler First of all: numpy performs well on large matrices, but not so well on small arrays. So if you could vectorize your code, e.g. call polynomial(p_points) instead of a for loop over single data points that would greatly improve performance.

Thanks for the suggestion; unfortunately the evaluation of the polynomial is part of an optimisation loop with unknown arguments a priori.

Also thanks for your detailed analysis of the internal execution performance. As it happens I tried to implement a simple version of a 2nd order polynomial with additions and multiplications; the performance was on par with the polynomial.

@eendebakpt
Copy link
Contributor

Adding the fast path for np.generic does improve performance, it might recover nearly all of the performance of numpy 1.26. I added the code and a becnhmark to #26550.

Benchmark results on main:

· Discovering benchmarks
· Running 5 total benchmarks (1 commits * 1 environments * 5 benchmarks)
[ 0.00%] ·· Benchmarking existing-pyC__projects_misc_numpy_.venv_Scripts_python.exe
[10.00%] ··· Running (bench_polynomial.Polynomial.time_polynomial_addition--).....
[60.00%] ··· bench_polynomial.Polynomial.time_polynomial_addition                                                                                   14.3±0.3μs
[70.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_array_1000                                                                      15.0±0.4μs
[80.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_array_3                                                                         5.01±0.2μs
[90.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_python_float                                                                   2.16±0.02μs
[100.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_scalar                                                                         2.20±0.01μs

On the PR

· Discovering benchmarks
· Running 5 total benchmarks (1 commits * 1 environments * 5 benchmarks)
[ 0.00%] ·· Benchmarking existing-pyC__projects_misc_numpy_.venv_Scripts_python.exe
[10.00%] ··· Running (bench_polynomial.Polynomial.time_polynomial_addition--).....
[60.00%] ··· bench_polynomial.Polynomial.time_polynomial_addition                                                                                  13.6±0.05μs
[70.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_array_1000                                                                        15.1±1μs
[80.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_array_3                                                                        5.21±0.03μs
[90.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_python_float                                                                   1.45±0.01μs
[100.00%] ··· bench_polynomial.Polynomial.time_polynomial_evaluation_scalar                                                                         1.51±0.04μs

@eendebakpt
Copy link
Contributor

With #26550 merged I think most of the performance regression has been recovered.

@PeterWurmsdobler One way to improve performance even more might be to use polyval directly. This works if the polynomial does not change during the optimization:

import numpy as np
from numpy.polynomial import Polynomial
from numpy.polynomial.polynomial import polyval

polynomial = Polynomial([10, 2, 6])
p_points = np.linspace(0.0, 1.0, num=1000)

def g():
    for p in p_points:
        polynomial(p)

polynomial = polynomial.convert()

def h():
    for p in p_points:
        polyval(p, polynomial.coef)

%timeit g()            
%timeit h()            
# 1.66 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 824 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@charris
Copy link
Member

charris commented Jul 6, 2024

If you actually use linspace, there is a linspace method in the Polynomial class that returns both the x and y values.

@kveretennicov
Copy link

Thank you very much, @eendebakpt and @charris! When do you think the next numpy release will be cut?

@seberg
Copy link
Member

seberg commented Aug 15, 2024

Going to close this, there doesn't seem to be a very concrete thing to improve here.

When do you think the next numpy release will be cut?

Not sure exactly, but in a few weeks there is already an rc.

@seberg seberg closed this as completed Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants