Skip to content

ENH: Enable AVX2/AVX512 support to numpy #10251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 26, 2017
Merged

ENH: Enable AVX2/AVX512 support to numpy #10251

merged 1 commit into from
Dec 26, 2017

Conversation

VictorRodriguez
Copy link
Contributor

This patch enables AVX2/AVX-512F instructions and distuils flags to
maximise the use of IA technology such as Haswell and Skylake platformns
on math functions of numpy

Compiled with :
python3 setup.py build -b py3 --fcompiler=gnu95

Examples of the results generated are:

mtrand.cpython-36m-x86_64-linux-gnu.so
mtrand.cpython-36m-x86_64-linux-gnu.so.avx2
mtrand.cpython-36m-x86_64-linux-gnu.so.avx512

Which has proper use of ZMMM / YMMM registers and FMA instructions

Signed-off-by: Arjan van de Ven arjan@linux.intel.com
Signed-off-by: William Douglas william.douglas@intel.com
Signed-off-by: Victor Rodriguez victor.rodriguez.bahena@intel.com

@njsmith
Copy link
Member

njsmith commented Dec 20, 2017 via email

@ghost
Copy link

ghost commented Dec 20, 2017

It would be better to check AVX at runtime (checking it on import would be negligible) and then dynamically adjust the code path.

@fenrus75
Copy link

There are two separate things in this PR (which is a bit confusing);

  1. create multiple .so's; this part of the PR is incomplete and lets pretend it's not there for now (hoping victor can update to just the code without that change)

  2. change to simd.inc.src to use AVX2 or AVX512 at compile time. This is solving the gap that if you compile numpy for avx2 (or 512) today, say with -march=native, you get the SSE code for the simd functions even though the the rest of the code gets AVX2. Yes ideally this is a runtime thing, but that is more complex change, and could easily be a follow-on to this first initial "make the code be there" change.

In terms of performance, for very basic vectorized math, AVX2 is a theoretical 2x increase over SSE, and AVX512 is another theoretical 2x. For code where multiply operations can be fused with add operations, AVX2 has a additional theoretical 2x to gain over SSE.

theoretical gains are not realized gains, but for basic functions like in simd.inc.src, it's pretty typical to get 90% or more of theoretical (depends a bit on how big the arrays are)

@ghost
Copy link

ghost commented Dec 21, 2017

If this PR is restricted to (2), then IMO it's mergable. It would be a nice enhancement to add runtime checks though because the wheel package format doesn't allow specification of a processor architecture, meaning that most people probably won't run this code.

@ghost
Copy link

ghost commented Dec 21, 2017

@VictorRodriguez It looks like you got hit with MSVC 2008. Just wrap your code like this:

#if !defined(_MSC_VER) || _MSC_VER >= 1600

#endif

@VictorRodriguez
Copy link
Contributor Author

@xoviat thanks a lot for the help, after multiple experiments, it is passing all the tests.
Ok for merging?

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@charris
Copy link
Member

charris commented Dec 23, 2017

@VictorRodriguez Could you add an entry to the 1.15 release notes under "Improvements"?

This patch enables AVX2/AVX-512F instructions and distuils flags to
maximise the use of IA technology such as Haswell and Skylake platformns
on math functions of numpy

Signed-off-by: Arjan van de Ven arjan@linux.intel.com
Signed-off-by: William Douglas william.douglas@intel.com
Signed-off-by: Victor Rodriguez victor.rodriguez.bahena@intel.com
@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 23, 2017
@VictorRodriguez
Copy link
Contributor Author

VictorRodriguez commented Dec 26, 2017

@charris I update the PR with a description under doc/release/1.15.0-notes.rst , is that ok ?

@charris charris merged commit 0385247 into numpy:master Dec 26, 2017
@charris
Copy link
Member

charris commented Dec 26, 2017

Thanks @VictorRodriguez .

@juliantaylor
Copy link
Contributor

juliantaylor commented Mar 31, 2018

on what type of cpu do you see 90% of the theoretical gains?
I have never seen more than 5% for this type of code on a pretty large range of intel cpus (xeon, and i5/i7) so I never bothered.
The code can also be simplified to ditch the unaligned paths, avx doesn't really care about that since haswell.

@juliantaylor
Copy link
Contributor

also this is broken as the overlap checks have not been adapted for the larger vector sizes.
The better approach to do avx for these simple things is probably to let the compiler do it like we do for the integers.
It was done manually for sse2 because the compilers were a lot worse back then.
Compilers that support avx can vectorize this stuff themselves.

@charris
Copy link
Member

charris commented Apr 13, 2018

@juliantaylor Is the overlap problem fixed by your recent PR?

@VictorRodriguez
Copy link
Contributor Author

@juliantaylor thanks a lot for your feedback. Let the compiler do this thing is risky since standard compilers don't do in the specific way we are doing here. We are using intrinsic immintrin.h instructions to load and execute AVX instructions. Also, this patch is solving the gap that if you compile numpy for avx2 (or 512) today, say with -march=native, you get the SSE code for the simd functions even though the rest of the code gets AVX2. if the overlap checks have not been adapted I will be happy to fix them. Since users are always looking for better performance on numpy applications I will recommend to leave this as part of the next release and see feedback from users. It is not broken and works fine since Clear Linux use the same approach . Objdump -d will show , i will upload some numbers soon (https://github.com/clearlinux-pkgs/numpy)

@juliantaylor
Copy link
Contributor

juliantaylor commented May 3, 2018

you are not doing anything special here this is the most trivial of vectorized code, compilers can do it for a very long time. At the very least gcc produces equivalent machine code.
I have tested quite a bit but it is still slightly slower than the sse code on all machines I have tested (which is most avx up to including haswell and amd equivalents) though I haven't gotten access to a skylake yet.
The alignment checks are wrong you are not handling overlap checks for the larger vectors, it is just not obvious how to trigger the overlap case here, you might need to use an accumulate operation.

juliantaylor added a commit to juliantaylor/numpy that referenced this pull request May 17, 2018
@juliantaylor juliantaylor mentioned this pull request May 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants