Skip to content

ENH: help compilers to auto-vectorize reduction operators #21001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 7, 2022

Conversation

zephyr111
Copy link
Contributor

@zephyr111 zephyr111 commented Feb 6, 2022

Hello everyone,

This PR is meant to significantly improve the performance of reduction functions (like numpy.sum for example): from 2.6 to 20 times faster. It also improves the performance of the casting function used by many Numpy functions. These changes are in response to this Stack Overflow detailed analysis showing that the current code of numpy.sum does not benefit from SIMD instructions (with both integers an floating-point numbers).

The PR changes the macros so the compiler can generate a faster code for reductions when the array is contiguous in a way that is similar to the current code in basic operators. It also generates an AVX2 version (if supported) of the conversion function called when the dtype of the reduction result type does not match with the one of the array items.

Please find below the performance results of numpy.sum with an array of 1 MiB on my Linux machine with an Intel i5-9600KF processor. The tests have been made for all the available integer types with and without a custom dtype argument set to the array dtype value (it is either int64 or uint64 by default on my machine).

Old:
    Default dtype argument:
        int8:    687 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   661 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   346 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  368 µs ± 3.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   152 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  196 µs ± 638 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   52.4 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  52.9 µs ± 521 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

    Custom dtype argument:
        int8:    404 µs ± 3.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   408 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   203 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  203 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   102 µs ± 551 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  102 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   52.7 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  52.9 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

New:
    Default dtype argument:
        int8:    263 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   248 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   120 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  116 µs ± 394 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   58.7 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  58.6 µs ± 759 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   17.1 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  16.9 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

    Custom dtype argument:
        int8:    21.7 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint8:   20.8 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int16:   18.1 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        uint16:  18.1 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        int32:   17.6 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint32:  17.9 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        int64:   16.9 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)
        uint64:  17.4 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 4000 loops each)

Speed up (ie. New/old):
    Default dtype argument:
        int8:     2.61
        uint8:    2.67
        int16:    2.88
        uint16:   3.17
        int32:    2.59
        uint32:   3.34
        int64:    3.06
        uint64:   3.13

    Custom dtype argument:
        int8:    18.61
        uint8:   19.61
        int16:   11.21
        uint16:  11.21
        int32:    5.80
        uint32:   5.70
        int64:    3.12
        uint64:   3.04

As you can see, it significantly improves the execution time (by a factor up to ~20x). This is especially the case when a dtype argument is provided since a (slow sub-optimal) conversion function is applied otherwise (as described in the above detailed analysis link).

Similar improvements applies to functions like np.prod, np.add.reduce, np.subtract.reduce, np.logical_and.reduce, etc.
It turns out that improving the casting function also results in a substantial faster execution time of many functions like np.cumsum, np.cumprod, np.all and np.any. Even functions like np.average, np.mean or basic multiplications/additions/subtractions by a constant of a different type requiring the array to be casted are a bit faster with an AVX2 casting function.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, and probably long overdue especially for the casting! @seiko2plus may have comments about using universal intrinsics, but it may be best to move forward here first and then add that later (at least for the casting part).

If you are up for it, I think it would make sense to split this into two PRs, one for casting and one for the integer reductions, that should make it a bit easier to discuss each change.
(It isn't a big PR though.)

I did not have a super close look yet, but I made some comments about casting and alignment. (Basically, I wonder if this is worthwhile for the unaligned case; and possibly incorrect for that one dead branch).

One thing you can also already do if you like: You could add a release note for performance if you like, see doc/release/upcoming_changes/README.rst

@seberg
Copy link
Member

seberg commented Feb 6, 2022

Hmm, had not noticed the test failure. Might be another reason to split up the two, since I am not sure they come from the reductions or the cast change. It looks like it should be due to the cast change, but it is not clear to me why.

(Note that the "benchmark" one is just spurious)

EDIT: Ah, the test seems to be float64 -> int16 cast for values that are beyond the int16 range. I suspect this may be undefined behaviour in C99. Not immediately sure how much we should worry, but it appears to be a change – right now, we get undefined behaviour only for NaN and Inf I think.

@zephyr111
Copy link
Contributor Author

zephyr111 commented Feb 6, 2022

It looks like it should be due to the cast change, but it is not clear to me why.

I saw that too but I found this very weird since I did not really changed the actual computation, just some kind of compiler "hints". Your point about alignment make me thing about either GCC could assume that the input is aligned with AVX2 and simply generate a wrong code for this benchmark. That being said, it would be very surprising since AFAIK GCC should not make such an assumption unless explicitly stated. Otherwise, I guess it could mean the current code has an undefined behavior and using AVX2 make it visible due to different optimizations (which is scary).

I suspect this may be undefined behaviour in C99.

It makes sense! This is indeed undefined behavior to cast a float value to an integer that cannot represent the integer part. See this StackOverflow post.

If you are up for it, I think it would make sense to split this into two PRs.

I am fine with this, but how am I supposed to proceed? Should I just need to revert the change on this PR and open a new PR with this one mentioned in it so to merge this one before?

@seberg
Copy link
Member

seberg commented Feb 6, 2022

I am fine with this, but how am I supposed to proceed? Should I just need to revert the change on this PR and open a new PR with this one mentioned in it so to merge this one before?

Sounds good, i.e. remove the casting part of the PR and create anew PR for that (or vice versa, I guess). About the undefined behaviour, I have to think. I am not surprised it behaves differently for UB, but if it was effectively always well defined, so we have to decide that this is a weird corner case that should not matter, or see if there is a trick to preserve behaviour.
(We could be pragmatic and split out vectorizing those casts, because unsafe casts are less common anyway.)

@zephyr111 zephyr111 force-pushed the faster-sum branch 2 times, most recently from 8b6bea3 to c785851 Compare February 26, 2022 16:13
@zephyr111
Copy link
Contributor Author

I added a release note and split the PR in two parts. This one still focus on faster reduction operators. The other part about casting is #21123.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit too burnt out to merge right now, but looks good to me. Unless @seiko2plus has any concerns?

It could be nice to add additional benchmarks, but we do have a benchmark for the addition path, so that seems OK since it is all in one chunk.

@@ -0,0 +1,3 @@
Faster reduction operators
--------------------------
Reduction operations like `numpy.sum`, `numpy.prod`, `numpy.add.reduce`, `numpy.logical_and.reduce` on contiguous integer-based arrays are now much faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to break the long line, but doesn't really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Co-authored-by: Matti Picus <matti.picus@gmail.com>
@seberg
Copy link
Member

seberg commented Mar 7, 2022

Lets put this in, looks like a nice speed improvement! If the SIMD crows changes this, having the contiguous special case for comparison is probably good also. Thanks @zephyr111.

(One day, I hope we can move these into the ArrayMethod's get_loop function, but that is a different issue.)

@seberg seberg merged commit 330057f into numpy:main Mar 7, 2022
@InessaPawson
Copy link
Member

Hi-five on merging your first pull request to NumPy, @zephyr111! We hope you stick around! Your choices aren’t limited to programming – you can review pull requests, help us stay on top of new and old issues, develop educational material, work on our website, add or improve graphic design, create marketing materials, translate website content, write grant proposals, and help with other fundraising initiatives. For more info, check out: https://numpy.org/contribute
Also, consider joining our mailing list. This is a great way to connect with other cool people in our community and be part of important conversations that affect the development of NumPy: https://mail.python.org/mailman/listinfo/numpy-discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants