ENH: Enable AVX2/AVX512 support to numpy #10251

VictorRodriguez · 2017-12-20T21:11:38Z

This patch enables AVX2/AVX-512F instructions and distuils flags to
maximise the use of IA technology such as Haswell and Skylake platformns
on math functions of numpy

Compiled with :
python3 setup.py build -b py3 --fcompiler=gnu95

Examples of the results generated are:

mtrand.cpython-36m-x86_64-linux-gnu.so
mtrand.cpython-36m-x86_64-linux-gnu.so.avx2
mtrand.cpython-36m-x86_64-linux-gnu.so.avx512

Which has proper use of ZMMM / YMMM registers and FMA instructions

Signed-off-by: Arjan van de Ven arjan@linux.intel.com
Signed-off-by: William Douglas william.douglas@intel.com
Signed-off-by: Victor Rodriguez victor.rodriguez.bahena@intel.com

njsmith · 2017-12-20T21:28:24Z

@juliantaylor thoughts? Also, can you expand on the logic behind generating multiple .so's for each extension module? How does python know which one to load? (*Does* it even load the fancier versions ever?) What's the approximate speedup, and what's the approximate effect on download size? Does this also affect downstream projects using numpy.distutils? On Dec 20, 2017 1:11 PM, "Victor Rodriguez" <notifications@github.com> wrote: This patch enables AVX2/AVX-512F instructions and distuils flags to maximise the use of IA technology such as Haswell and Skylake platformns on math functions of numpy Compiled with : python3 setup.py build -b py3 --fcompiler=gnu95 Examples of the results generated are: mtrand.cpython-36m-x86_64-linux-gnu.so mtrand.cpython-36m-x86_64-linux-gnu.so.avx2 mtrand.cpython-36m-x86_64-linux-gnu.so.avx512 Which has proper use of ZMMM / YMMM registers and FMA instructions Signed-off-by: Arjan van de Ven arjan@linux.intel.com Signed-off-by: William Douglas william.douglas@intel.com Signed-off-by: Victor Rodriguez victor.rodriguez.bahena@intel.com ------------------------------ You can view, comment on, or merge this pull request online at: #10251 Commit Summary - ENH: Enable AVX2/AVX512 support to numpy File Changes - *M* numpy/core/src/umath/simd.inc.src <https://github.com/numpy/numpy/pull/10251/files#diff-0> (195) - *M* numpy/distutils/fcompiler/__init__.py <https://github.com/numpy/numpy/pull/10251/files#diff-1> (35) - *M* numpy/distutils/unixccompiler.py <https://github.com/numpy/numpy/pull/10251/files#diff-2> (4) Patch Links: - https://github.com/numpy/numpy/pull/10251.patch - https://github.com/numpy/numpy/pull/10251.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10251>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAlOaNCpHhkgPYT_J_1KNfScQHAPLsSnks5tCXgMgaJpZM4RI93q> .

ghost · 2017-12-20T23:23:55Z

It would be better to check AVX at runtime (checking it on import would be negligible) and then dynamically adjust the code path.

fenrus75 · 2017-12-21T15:52:01Z

There are two separate things in this PR (which is a bit confusing);

create multiple .so's; this part of the PR is incomplete and lets pretend it's not there for now (hoping victor can update to just the code without that change)
change to simd.inc.src to use AVX2 or AVX512 at compile time. This is solving the gap that if you compile numpy for avx2 (or 512) today, say with -march=native, you get the SSE code for the simd functions even though the the rest of the code gets AVX2. Yes ideally this is a runtime thing, but that is more complex change, and could easily be a follow-on to this first initial "make the code be there" change.

In terms of performance, for very basic vectorized math, AVX2 is a theoretical 2x increase over SSE, and AVX512 is another theoretical 2x. For code where multiply operations can be fused with add operations, AVX2 has a additional theoretical 2x to gain over SSE.

theoretical gains are not realized gains, but for basic functions like in simd.inc.src, it's pretty typical to get 90% or more of theoretical (depends a bit on how big the arrays are)

ghost · 2017-12-21T17:17:28Z

If this PR is restricted to (2), then IMO it's mergable. It would be a nice enhancement to add runtime checks though because the wheel package format doesn't allow specification of a processor architecture, meaning that most people probably won't run this code.

ghost · 2017-12-21T20:56:13Z

@VictorRodriguez It looks like you got hit with MSVC 2008. Just wrap your code like this:

#if !defined(_MSC_VER) || _MSC_VER >= 1600

#endif

VictorRodriguez · 2017-12-22T01:56:51Z

@xoviat thanks a lot for the help, after multiple experiments, it is passing all the tests.
Ok for merging?

ghost

LGTM.

charris · 2017-12-23T12:43:05Z

@VictorRodriguez Could you add an entry to the 1.15 release notes under "Improvements"?

This patch enables AVX2/AVX-512F instructions and distuils flags to maximise the use of IA technology such as Haswell and Skylake platformns on math functions of numpy Signed-off-by: Arjan van de Ven arjan@linux.intel.com Signed-off-by: William Douglas william.douglas@intel.com Signed-off-by: Victor Rodriguez victor.rodriguez.bahena@intel.com

VictorRodriguez · 2017-12-26T03:02:53Z

@charris I update the PR with a description under doc/release/1.15.0-notes.rst , is that ok ?

charris · 2017-12-26T04:19:30Z

Thanks @VictorRodriguez .

juliantaylor · 2018-03-31T15:48:38Z

on what type of cpu do you see 90% of the theoretical gains?
I have never seen more than 5% for this type of code on a pretty large range of intel cpus (xeon, and i5/i7) so I never bothered.
The code can also be simplified to ditch the unaligned paths, avx doesn't really care about that since haswell.

juliantaylor · 2018-03-31T16:17:45Z

also this is broken as the overlap checks have not been adapted for the larger vector sizes.
The better approach to do avx for these simple things is probably to let the compiler do it like we do for the integers.
It was done manually for sse2 because the compilers were a lot worse back then.
Compilers that support avx can vectorize this stuff themselves.

charris · 2018-04-13T22:57:11Z

@juliantaylor Is the overlap problem fixed by your recent PR?

VictorRodriguez · 2018-04-16T15:42:05Z

@juliantaylor thanks a lot for your feedback. Let the compiler do this thing is risky since standard compilers don't do in the specific way we are doing here. We are using intrinsic immintrin.h instructions to load and execute AVX instructions. Also, this patch is solving the gap that if you compile numpy for avx2 (or 512) today, say with -march=native, you get the SSE code for the simd functions even though the rest of the code gets AVX2. if the overlap checks have not been adapted I will be happy to fix them. Since users are always looking for better performance on numpy applications I will recommend to leave this as part of the next release and see feedback from users. It is not broken and works fine since Clear Linux use the same approach . Objdump -d will show , i will upload some numbers soon (https://github.com/clearlinux-pkgs/numpy)

juliantaylor · 2018-05-03T15:08:47Z

you are not doing anything special here this is the most trivial of vectorized code, compilers can do it for a very long time. At the very least gcc produces equivalent machine code.
I have tested quite a bit but it is still slightly slower than the sse code on all machines I have tested (which is most avx up to including haswell and amd equivalents) though I haven't gotten access to a skylake yet.
The alignment checks are wrong you are not handling overlap checks for the larger vectors, it is just not obvious how to trigger the overlap case here, you might need to use an accumulate operation.

This reverts commit bcf949b.

ghost approved these changes Dec 22, 2017

View reviewed changes

charris added 01 - Enhancement component: numpy._core labels Dec 23, 2017

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 23, 2017

charris merged commit 0385247 into numpy:master Dec 26, 2017

juliantaylor added a commit to juliantaylor/numpy that referenced this pull request May 17, 2018

Revert "ENH: Enable AVX2/AVX512 support to numpy numpy#10251"

b52106b

This reverts commit bcf949b.

juliantaylor mentioned this pull request May 17, 2018

AVX for floats? #11113

Closed

andreasnoack mentioned this pull request Dec 6, 2018

sum(a) is now 30% slower than NumPy JuliaLang/julia#30290

Closed

Uh oh!

ENH: Enable AVX2/AVX512 support to numpy #10251

ENH: Enable AVX2/AVX512 support to numpy #10251

Uh oh!

Conversation

VictorRodriguez commented Dec 20, 2017

Uh oh!

njsmith commented Dec 20, 2017 via email

Uh oh!

ghost commented Dec 20, 2017

Uh oh!

fenrus75 commented Dec 21, 2017

Uh oh!

ghost commented Dec 21, 2017 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Dec 21, 2017

Uh oh!

VictorRodriguez commented Dec 22, 2017

Uh oh!

ghost left a comment

Choose a reason for hiding this comment

Uh oh!

charris commented Dec 23, 2017

Uh oh!

VictorRodriguez commented Dec 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 26, 2017

Uh oh!

juliantaylor commented Mar 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented Mar 31, 2018

Uh oh!

charris commented Apr 13, 2018

Uh oh!

VictorRodriguez commented Apr 16, 2018

Uh oh!

juliantaylor commented May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ghost commented Dec 21, 2017 •

edited by ghost

Loading

VictorRodriguez commented Dec 26, 2017 •

edited

Loading

juliantaylor commented Mar 31, 2018 •

edited

Loading

juliantaylor commented May 3, 2018 •

edited

Loading