-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
AVX for floats? #11113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX for floats? #11113
Conversation
see gh-10251 |
I agree that leaving these things to the compilers is the way to go when it can be made to work. @VictorRodriguez Thoughts? |
a) Compilers do not work that way, we are using intrinsic that improve https://github.com/VictorRodriguez/operating-systems-lecture/tree/master/labs/gcc/avx_training b) We are working on the same base as proposed by this patch : where it said: "This file is for the definitions of SIMD vectorized operations.. Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...) In future it may contain other instruction sets like AVX or NEON ..." my patch is an extension to what was proposed before in simd.inc.src c ) The performance is hard to measure with micro benchmarks , there is a proper numpy benchmark: https://openbenchmarking.org/test/pts/numpy the micro-benchmarks that you propose is hard to use as a reference for this change The patch is a good extension of what was done before. I have been testing numpy for a while and functional tests pass in AVX/AVX 512 platforms. If it creates memory overlap and you consider it might generate problems in some accumulate cases I am more than happy to fix them. |
I am all for adding more modern isas to numpy, but it has to be in places it makes sense. The basic math functions unfortunately do not make sense ins numpys current design. |
Sorry for the delay in answering :
and with -O3 -march=haswell
As you can see is much more optimized when we use immintrin.h
-> under what cases? I have not seen that in the numpy that I have built with this approach ?
"I am all for adding more modern is to numpy, but it has to be in places it makes sense. The basic math functions, unfortunately, do not make sense ins numpy current design. Ok so I am in, please guide me what areas are suitable for the proper use of AVX instructions in Numpy. I am very interested to provide the community a better and faster Numpy that enhance the HW they have as much as possible, in the meantime please do not revert the change until we have the other sections. If we have solid numbers measured that shows my patch decrease performance I will be the one that makes revert to it. In the meantime I am more than happy to work on the other areas you mention, please point me at what part of numpy code the SIMD instructions are a good candidate. |
when I wrote this code aeons ago I assumed avx wouldprovide a performance benefit, unfortunately this could never be achieved, else I would have added it myself. what version of gcc are you using to compile your testcase? gcc 7 produces this perfectly find code:
gcc-5 produces the same code and I am very sure you get the similar code even when going back to around version 4.6-4.7 The reason numpy cannot effectively use the L1 cache is because it has huge overheads in the most basic operations, the code we are talking about here is executed via the very generic universal function infrastructure. This infrastructure (and python) add huge amounts of overheads that besides polluting the cache between each operations overshadow the computational cost of the L1 operation. ufunc overhead is around 60-65% of the computing cost for arrays which fit into the L1 cache. So many microoptimizations squeezing out a cycle here and there will have almost no effect. In order to use AVX effectively for these operations we would need some form of operation merging/lazy evaluation as for example numexpr does. As I do consistently see small slowdown using avx compared to sse in these functions and as it is compile time enabled will see very little testing I still favor reverting this code (and not adding my variant). Though I'll leave that decision to the more active developers. The probable overlap issue can simply be fixed by replacing the 16 byte distance checks with 32/64 byte checks at compile time. An area which comes to mind which would profit is convolutions. the 1d convolutions are next to the offloaded matrix multiplications the densest computation type in numpy so a natural fit to vectorization. For something simpler the boolean operations in the same simd.inc.src file would probably benefit from an avx2 implementation. |
I am using GCC 8, but GCC 7 might also give the same on the same example code I posted before, what code are you compiling and which flags were used?. I really appreciate you share the branch and code for vect-convolve I will take the AR to merge it first as SSE and then as AVX. As soon as ai have a functional and well measure pull request i will add you as a reviewer. Thanks a lot for the guidance and help |
This is more change than I would like in a minor release, so pushing the milestone back to 1.16. |
the point is more about reverting the change if no one can show its actually faster or fixing the overlap bug |
@juliantaylor I overlap bug on my TODO list ( just delayed by work stuff). 1.15 has just been released, can we wait a bit to see how the community reacts and prove if it is faster for them ? |
Will this see work before 1.16 feature freeze? |
Going to push this off. I suggest that the release note be omitted until we settle on whether we can do this or not. @VictorRodriguez What is the status of the overlap bug? |
@charris i have a WIP patch, should I send it out? is not working yet |
@VictorRodriguez Might put it up for discussion, with a link to this. It certainly won't make it into 1.16. |
here is a fix for the overlap issue #12398 |
Ping -- this has 1.17 milestone-- is it realistically going to be ready for that? |
@tylerjereddy could you please remind me what is the goal that needs to be before 1.17 milestone? I had the impression this was fixed with the overlap commit fix. Thanks |
@VictorRodriguez Are you saying this PR isn't needed anymore? I'm just probing here to see what is going on given ~6 months since last comment. If we don't need the PR that's even easier. |
@tylerjereddy the PR is removing my patch for AVX support for numpy, @juliantaylor point is that the compiler can do that job and is not necessary to do it by intrinsics with my patch. The only argue thing could be to provide numbers w/ and w/o the patch. |
Going to push this off to 1.18, the "cleanup" release. I like the idea of leaving the optimizations to the compilers, and the compilers are getting smarter. @juliantaylor Might take a look at what other optimizations we might want to leave to the compilers, the simpler we can keep our code, the better. |
I checked the generated binary code of reverting the AVX512/AVX2 intrinsics, it's different as when with these intrinsic. The experiement is done with Numpy1.17.2, and GCC 9.2.1 (gcc version 9.2.1 20190913 gcc-9-branch@275694). Let's take sse2_binary_scalar1_divide_DOUBLE function as an example. Without the intrinsic (means reverting Victor's patch of intrinsic implemenation), GCC generated binary code (with march=skylake-avx512) will still use xmm to do the main block divide, while with Victor's patch, the binary code goes with zmm based AVX512 dividing for the main block. 00000000001e0bc0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0>:
1e0bc0: c5 fb 10 0a vmovsd xmm1,QWORD PTR [rdx]
1e0bc4: 49 89 f8 mov r8,rdi
1e0bc7: c5 fb 12 d1 vmovddup xmm2,xmm1
1e0bcb: 41 83 e0 0f and r8d,0xf
1e0bcf: 0f 85 1b 02 00 00 jne 1e0df0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x230>
1e0bd5: 49 89 c8 mov r8,rcx
1e0bd8: 49 89 f1 mov r9,rsi
1e0bdb: 31 c0 xor eax,eax
1e0bdd: 49 83 e0 fe and r8,0xfffffffffffffffe
1e0be1: 41 80 e1 0f and r9b,0xf
1e0be5: 0f 84 42 02 00 00 je 1e0e2d <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x26d>
1e0beb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
1e0bf0: 4c 39 c0 cmp rax,r8
1e0bf3: 7d 1a jge 1e0c0f <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x4f>
1e0bf5: 0f 1f 00 nop DWORD PTR [rax]
1e0bf8: c5 f9 10 2c c6 vmovupd xmm5,XMMWORD PTR [rsi+rax*8]
1e0bfd: c5 d1 5e c2 vdivpd xmm0,xmm5,xmm2
1e0c01: c5 f8 29 04 c7 vmovaps XMMWORD PTR [rdi+rax*8],xmm0
1e0c06: 48 83 c0 02 add rax,0x2
1e0c0a: 49 39 c0 cmp r8,rax
1e0c0d: 7f e9 jg 1e0bf8 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x38>
... With Victor's patch (which is also the default of numpy): 00000000001e0c40 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0>:
1e0c40: c5 fb 10 12 vmovsd xmm2,QWORD PTR [rdx]
1e0c44: 49 89 f8 mov r8,rdi
1e0c47: 62 f2 fd 48 19 ca vbroadcastsd zmm1,xmm2
1e0c4d: 41 83 e0 3f and r8d,0x3f
1e0c51: 0f 85 09 02 00 00 jne 1e0e60 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x220>
1e0c57: 49 89 c8 mov r8,rcx
1e0c5a: 49 89 f1 mov r9,rsi
1e0c5d: 31 c0 xor eax,eax
1e0c5f: 49 83 e0 f8 and r8,0xfffffffffffffff8
1e0c63: 41 83 e1 3f and r9d,0x3f
1e0c67: 0f 84 53 02 00 00 je 1e0ec0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x280>
1e0c6d: 0f 1f 00 nop DWORD PTR [rax]
1e0c70: 4c 39 c0 cmp rax,r8
1e0c73: 7d 20 jge 1e0c95 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x55>
1e0c75: 0f 1f 00 nop DWORD PTR [rax]
1e0c78: 62 f1 fd 48 10 2c c6 vmovupd zmm5,ZMMWORD PTR [rsi+rax*8]
1e0c7f: 62 f1 d5 48 5e c1 vdivpd zmm0,zmm5,zmm1
1e0c85: 62 f1 fd 48 29 04 c7 vmovapd ZMMWORD PTR [rdi+rax*8],zmm0
1e0c8c: 48 83 c0 08 add rax,0x8
1e0c90: 49 39 c0 cmp r8,rax
1e0c93: 7f e3 jg 1e0c78 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x38>
... With the latency data from Intel's intruction manual, the main loop latency of AVX512 intrinsic version is 8+23+5 = 36, which is for 8 double: 1e0c78: 62 f1 fd 48 10 2c c6 vmovupd zmm5,ZMMWORD PTR [rsi+rax*8]
1e0c7f: 62 f1 d5 48 5e c1 vdivpd zmm0,zmm5,zmm1
1e0c85: 62 f1 fd 48 29 04 c7 vmovapd ZMMWORD PTR [rdi+rax*8],zmm0 While the reverted version go with xmm, which main loop latency is 6+14+5 = 25, which is for 2 double: 1e0bf8: c5 f9 10 2c c6 vmovupd xmm5,XMMWORD PTR [rsi+rax*8]
1e0bfd: c5 d1 5e c2 vdivpd xmm0,xmm5,xmm2
1e0c01: c5 f8 29 04 c7 vmovaps XMMWORD PTR [rdi+rax*8],xmm0 And, the AVX2 version with the intrinsic is 7+14+5 = 26 for 4 double (I don't paste the binary code here). So suppose the AVX512 intrinsic version costs 4.5 cycles for a double, while AVX2 costs 6.5 cycles for a double, and SSE version with 12.5 cycles for a double. Based on this, I think the Victor's intrinsic patch still worth to be kept in Numpy as beating GCC's default generation of AVX512/AVX2. I attached the full binary code of this function for (AVX512_reverted, AVX512_intrinsic, AVX2_reverted, AVX2_intrinsic) which is dumped with objdump from _multiarray_umath.cpython-37m-x86_64-linux-gnu.so. The full objdump file is too big that cannot be uploaded, ping me if you want, I can send you with email. |
I'm going to push this off again. It needs a rebase, and after the optimization infrastructure patches go in might be a better time to revisit. |
@charris Guess here 'the optimization infrastructure patches' you mean the PR #13516. And Yes, agree that with this PR-13516 in, the intrinsic code for this topic related need to be revisited to use wrapper generic intrinsic as defined by PR-13516. |
Thanks! Too bad we never put this in. But by now, we do have AVX on I think all of these. Added by @r-devulap, so I am not 100% that covered everything, I think it covers enough to close the PR. |
We currently have some avx specializations for float functions in our code for when numpy is compiled specifically for these instruction sets. They have currently do not correctly account for memory overlap so there might be problems in some accumulate cases. This is easy to fix, but my concern is that is a bunch of code we probably don't really need for two reasons:
Compilers can vectorize these trivial functions themselves and I have never been able to produce a speedup from using avx on a pretty large range of different intel and amd cpus (pretty much everything up to including haswell/ryzen but not skylake or threadripper, on xeons and iX). This is not that surprising as in these simple functions we are memory bound and numpy overheads are to large to make use of more than the L2 cache. The two functions which could be cpu bound, division and square root are both implemented as sequential 128 bit operations in intel (at least up to haswell) so no significant speedup is to be expected there either.
Here is an alternative just leaving the vectorization to the compiler, this as currently implemented still requires GCC. The code generation is perfectly fine though there are differences depending on whether you tune (-mtune) the code for haswell (default in gcc >= 8) or older cpus. The performance is in my testing pretty much identical for our code in either case.
I do not propose to merge this PR but I want to post it for others to test, maybe I just happened to use some poorly performing hardware every time.
Unfortunately I seem to have lost my old benchmarks, though this small code represents the best possible code for this vectorization as due to being an L2 cache sized scalar inplace operation it implies the minimal amount of memory traffic.
On my current machines an amd ryzen AVX desktop and a haswell i5 laptop it is about 5% slower than the original SSE2 code.
Please try it out, maybe also with other options like
OPT="-DNDEBUG -mavx2 -O2"
to activate avx for float on master and let me know if you see any speedup.If nobody can reproduce benefits I recommend to revert the avx/avx512 code as it adds a lot of code only activated by compiler flags and thus gets very little usage and testing.