ENH: help compilers to auto-vectorize reduction operators #21001

zephyr111 · 2022-02-06T02:01:50Z

Hello everyone,

This PR is meant to significantly improve the performance of reduction functions (like numpy.sum for example): from 2.6 to 20 times faster. It also improves the performance of the casting function used by many Numpy functions. These changes are in response to this Stack Overflow detailed analysis showing that the current code of numpy.sum does not benefit from SIMD instructions (with both integers an floating-point numbers).

The PR changes the macros so the compiler can generate a faster code for reductions when the array is contiguous in a way that is similar to the current code in basic operators. It also generates an AVX2 version (if supported) of the conversion function called when the dtype of the reduction result type does not match with the one of the array items.

Please find below the performance results of numpy.sum with an array of 1 MiB on my Linux machine with an Intel i5-9600KF processor. The tests have been made for all the available integer types with and without a custom dtype argument set to the array dtype value (it is either int64 or uint64 by default on my machine).

Old:
    Default dtype argument:
        int8:    687 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   661 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   346 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  368 µs ± 3.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   152 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  196 µs ± 638 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   52.4 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  52.9 µs ± 521 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

    Custom dtype argument:
        int8:    404 µs ± 3.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   408 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   203 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  203 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   102 µs ± 551 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  102 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   52.7 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  52.9 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

New:
    Default dtype argument:
        int8:    263 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   248 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   120 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  116 µs ± 394 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   58.7 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  58.6 µs ± 759 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   17.1 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  16.9 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

    Custom dtype argument:
        int8:    21.7 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   20.8 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   18.1 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  18.1 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   17.6 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  17.9 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   16.9 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  17.4 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

Speed up (ie. New/old):
    Default dtype argument:
        int8:     2.61
        uint8:    2.67
        int16:    2.88
        uint16:   3.17
        int32:    2.59
        uint32:   3.34
        int64:    3.06
        uint64:   3.13

    Custom dtype argument:
        int8:    18.61
        uint8:   19.61
        int16:   11.21
        uint16:  11.21
        int32:    5.80
        uint32:   5.70
        int64:    3.12
        uint64:   3.04

As you can see, it significantly improves the execution time (by a factor up to ~20x). This is especially the case when a dtype argument is provided since a (slow sub-optimal) conversion function is applied otherwise (as described in the above detailed analysis link).

Similar improvements applies to functions like np.prod, np.add.reduce, np.subtract.reduce, np.logical_and.reduce, etc.
It turns out that improving the casting function also results in a substantial faster execution time of many functions like np.cumsum, np.cumprod, np.all and np.any. Even functions like np.average, np.mean or basic multiplications/additions/subtractions by a constant of a different type requiring the array to be casted are a bit faster with an AVX2 casting function.

seberg

Cool, and probably long overdue especially for the casting! @seiko2plus may have comments about using universal intrinsics, but it may be best to move forward here first and then add that later (at least for the casting part).

If you are up for it, I think it would make sense to split this into two PRs, one for casting and one for the integer reductions, that should make it a bit easier to discuss each change.
(It isn't a big PR though.)

I did not have a super close look yet, but I made some comments about casting and alignment. (Basically, I wonder if this is worthwhile for the unaligned case; and possibly incorrect for that one dead branch).

One thing you can also already do if you like: You could add a release note for performance if you like, see doc/release/upcoming_changes/README.rst

numpy/core/src/multiarray/lowlevel_strided_loops.c.src

seberg · 2022-02-06T16:16:15Z

Hmm, had not noticed the test failure. Might be another reason to split up the two, since I am not sure they come from the reductions or the cast change. It looks like it should be due to the cast change, but it is not clear to me why.

(Note that the "benchmark" one is just spurious)

EDIT: Ah, the test seems to be float64 -> int16 cast for values that are beyond the int16 range. I suspect this may be undefined behaviour in C99. Not immediately sure how much we should worry, but it appears to be a change – right now, we get undefined behaviour only for NaN and Inf I think.

zephyr111 · 2022-02-06T18:35:10Z

It looks like it should be due to the cast change, but it is not clear to me why.

I saw that too but I found this very weird since I did not really changed the actual computation, just some kind of compiler "hints". Your point about alignment make me thing about either GCC could assume that the input is aligned with AVX2 and simply generate a wrong code for this benchmark. That being said, it would be very surprising since AFAIK GCC should not make such an assumption unless explicitly stated. Otherwise, I guess it could mean the current code has an undefined behavior and using AVX2 make it visible due to different optimizations (which is scary).

I suspect this may be undefined behaviour in C99.

It makes sense! This is indeed undefined behavior to cast a float value to an integer that cannot represent the integer part. See this StackOverflow post.

If you are up for it, I think it would make sense to split this into two PRs.

I am fine with this, but how am I supposed to proceed? Should I just need to revert the change on this PR and open a new PR with this one mentioned in it so to merge this one before?

seberg · 2022-02-06T19:48:40Z

I am fine with this, but how am I supposed to proceed? Should I just need to revert the change on this PR and open a new PR with this one mentioned in it so to merge this one before?

Sounds good, i.e. remove the casting part of the PR and create anew PR for that (or vice versa, I guess). About the undefined behaviour, I have to think. I am not surprised it behaves differently for UB, but if it was effectively always well defined, so we have to decide that this is a weird corner case that should not matter, or see if there is a trick to preserve behaviour.
(We could be pragmatic and split out vectorizing those casts, because unsafe casts are less common anyway.)

zephyr111 · 2022-02-26T16:53:34Z

I added a release note and split the PR in two parts. This one still focus on faster reduction operators. The other part about casting is #21123.

doc/release/upcoming_changes/21001.performance.rst

seberg

A bit too burnt out to merge right now, but looks good to me. Unless @seiko2plus has any concerns?

It could be nice to add additional benchmarks, but we do have a benchmark for the addition path, so that seems OK since it is all in one chunk.

seberg · 2022-02-28T21:13:05Z

doc/release/upcoming_changes/21001.performance.rst

@@ -0,0 +1,3 @@
+Faster reduction operators
+--------------------------
+Reduction operations like `numpy.sum`, `numpy.prod`, `numpy.add.reduce`, `numpy.logical_and.reduce` on contiguous integer-based arrays are now much faster.


Would be nice to break the long line, but doesn't really matter.

Co-authored-by: Matti Picus <matti.picus@gmail.com>

seberg · 2022-03-07T15:52:26Z

Lets put this in, looks like a nice speed improvement! If the SIMD crows changes this, having the contiguous special case for comparison is probably good also. Thanks @zephyr111.

(One day, I hope we can move these into the ArrayMethod's get_loop function, but that is a different issue.)

InessaPawson · 2022-03-08T16:08:06Z

Hi-five on merging your first pull request to NumPy, @zephyr111! We hope you stick around! Your choices aren’t limited to programming – you can review pull requests, help us stay on top of new and old issues, develop educational material, work on our website, add or improve graphic design, create marketing materials, translate website content, write grant proposals, and help with other fundraising initiatives. For more info, check out: https://numpy.org/contribute
Also, consider joining our mailing list. This is a great way to connect with other cool people in our community and be part of important conversations that affect the development of NumPy: https://mail.python.org/mailman/listinfo/numpy-discussion

github-actions bot added the 01 - Enhancement label Feb 6, 2022

zephyr111 force-pushed the faster-sum branch from c71fc31 to 9dcb4d5 Compare February 6, 2022 07:07

seberg reviewed Feb 6, 2022

View reviewed changes

numpy/core/src/multiarray/lowlevel_strided_loops.c.src Outdated Show resolved Hide resolved

numpy/core/src/multiarray/lowlevel_strided_loops.c.src Outdated Show resolved Hide resolved

zephyr111 force-pushed the faster-sum branch 2 times, most recently from 8b6bea3 to c785851 Compare February 26, 2022 16:13

ENH: help compilers to auto-vectorize reduction operators

3ea71da

zephyr111 force-pushed the faster-sum branch from c785851 to 3ea71da Compare February 26, 2022 16:32

zephyr111 mentioned this pull request Feb 26, 2022

ENH: improve the speed of array conversions using AVX2 if available #21123

Open

mattip reviewed Feb 27, 2022

View reviewed changes

doc/release/upcoming_changes/21001.performance.rst Outdated Show resolved Hide resolved

seberg approved these changes Feb 28, 2022

View reviewed changes

DOC: fix the doc build

c1f2ddc

Co-authored-by: Matti Picus <matti.picus@gmail.com>

zephyr111 force-pushed the faster-sum branch from 813ce26 to c1f2ddc Compare March 6, 2022 14:03

seberg merged commit 330057f into numpy:main Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: help compilers to auto-vectorize reduction operators #21001

ENH: help compilers to auto-vectorize reduction operators #21001

zephyr111 commented Feb 6, 2022 •

edited

Loading

seberg left a comment

seberg commented Feb 6, 2022 •

edited

Loading

zephyr111 commented Feb 6, 2022 •

edited

Loading

seberg commented Feb 6, 2022

zephyr111 commented Feb 26, 2022

seberg left a comment

seberg Feb 28, 2022

zephyr111 Mar 6, 2022

seberg commented Mar 7, 2022

InessaPawson commented Mar 8, 2022

ENH: help compilers to auto-vectorize reduction operators #21001

ENH: help compilers to auto-vectorize reduction operators #21001

Conversation

zephyr111 commented Feb 6, 2022 • edited Loading

seberg left a comment

Choose a reason for hiding this comment

seberg commented Feb 6, 2022 • edited Loading

zephyr111 commented Feb 6, 2022 • edited Loading

seberg commented Feb 6, 2022

zephyr111 commented Feb 26, 2022

seberg left a comment

Choose a reason for hiding this comment

seberg Feb 28, 2022

Choose a reason for hiding this comment

zephyr111 Mar 6, 2022

Choose a reason for hiding this comment

seberg commented Mar 7, 2022

InessaPawson commented Mar 8, 2022

zephyr111 commented Feb 6, 2022 •

edited

Loading

seberg commented Feb 6, 2022 •

edited

Loading

zephyr111 commented Feb 6, 2022 •

edited

Loading