Skip to content

AVX for floats? #11113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

AVX for floats? #11113

wants to merge 2 commits into from

Conversation

juliantaylor
Copy link
Contributor

@juliantaylor juliantaylor commented May 17, 2018

We currently have some avx specializations for float functions in our code for when numpy is compiled specifically for these instruction sets. They have currently do not correctly account for memory overlap so there might be problems in some accumulate cases. This is easy to fix, but my concern is that is a bunch of code we probably don't really need for two reasons:

Compilers can vectorize these trivial functions themselves and I have never been able to produce a speedup from using avx on a pretty large range of different intel and amd cpus (pretty much everything up to including haswell/ryzen but not skylake or threadripper, on xeons and iX). This is not that surprising as in these simple functions we are memory bound and numpy overheads are to large to make use of more than the L2 cache. The two functions which could be cpu bound, division and square root are both implemented as sequential 128 bit operations in intel (at least up to haswell) so no significant speedup is to be expected there either.

Here is an alternative just leaving the vectorization to the compiler, this as currently implemented still requires GCC. The code generation is perfectly fine though there are differences depending on whether you tune (-mtune) the code for haswell (default in gcc >= 8) or older cpus. The performance is in my testing pretty much identical for our code in either case.

I do not propose to merge this PR but I want to post it for others to test, maybe I just happened to use some poorly performing hardware every time.
Unfortunately I seem to have lost my old benchmarks, though this small code represents the best possible code for this vectorization as due to being an L2 cache sized scalar inplace operation it implies the minimal amount of memory traffic.
On my current machines an amd ryzen AVX desktop and a haswell i5 laptop it is about 5% slower than the original SSE2 code.
Please try it out, maybe also with other options like OPT="-DNDEBUG -mavx2 -O2" to activate avx for float on master and let me know if you see any speedup.

If nobody can reproduce benefits I recommend to revert the avx/avx512 code as it adds a lot of code only activated by compiler flags and thus gets very little usage and testing.

import numpy as np
d = np.ones(10000, dtype=np.float32)

%timeit np.add(d, 0, out=d)
%timeit np.multiply(d, 1., out=d)
%timeit np.divide(d, 1., out=d)

@juliantaylor
Copy link
Contributor Author

see gh-10251

@charris
Copy link
Member

charris commented May 17, 2018

I agree that leaving these things to the compilers is the way to go when it can be made to work.

@VictorRodriguez Thoughts?

@VictorRodriguez
Copy link
Contributor

VictorRodriguez commented May 19, 2018

a) Compilers do not work that way, we are using intrinsic that improve
performance by itself, please take a look at this simple example in C
that proves how immintrin.h actually gives better results

https://github.com/VictorRodriguez/operating-systems-lecture/tree/master/labs/gcc/avx_training

b) We are working on the same base as proposed by this patch :

6312353

where it said: "This file is for the definitions of SIMD vectorized operations.. Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...) In future it may contain other instruction sets like AVX or NEON ..." my patch is an extension to what was proposed before in simd.inc.src

c ) The performance is hard to measure with micro benchmarks , there is a proper numpy benchmark:

https://openbenchmarking.org/test/pts/numpy

the micro-benchmarks that you propose is hard to use as a reference for this change

The patch is a good extension of what was done before. I have been testing numpy for a while and functional tests pass in AVX/AVX 512 platforms. If it creates memory overlap and you consider it might generate problems in some accumulate cases I am more than happy to fix them.

@juliantaylor
Copy link
Contributor Author

juliantaylor commented May 21, 2018

  1. compilers very much work that way. This code is not difficult to vectorize. The compilers cannot deal with everything, that is why numpy has a bunch of handwritten intrinsic code, but this code is not one of these case.
  2. you are comparing float and integer code, avx2 is beneficial for integer as sse2 does not provide good integer vectorization. numpy already uses avx2 for integer since a few releases.
  3. despite the incomparable examples in term of code generation it is still equivalent in performance on my amd ryzen (both 3 seconds, which is surprisingly fast compared to what you have posted, what kind of machine was that on?)
    The code generation of the compiler is also perfectly fine. If you want to unroll add -funroll-loops or the equivalent pragma. But unrolling is not useful in this type of code and current cpus, it provides no benefit so the compiler does not do it by default
  4. in your third example the compiler clearly optimized the code away, otherwise a 0s result is impossible
  5. due to overheads numpy cannot effectively use the L1 cache so most micro optimizations at assembly level are masked by memory latency.
  6. micro benchmarks are useful for this code, if you cannot show a speedup in the best possible scenario a macro benchmark will also not change. The only reason why it might change could be skipp the vzeroupper some intel chips require to avoid switching between sse and avx context. But I have not been able to produce a measurable difference due to this.

I am all for adding more modern isas to numpy, but it has to be in places it makes sense. The basic math functions unfortunately do not make sense ins numpys current design.
Places where it makes sense are for example the boolean functions, complex math, convolutions and accumulative operations. If you want to make the largest impact please work on these areas. I gladly can provide support ion those areas.

@VictorRodriguez
Copy link
Contributor

VictorRodriguez commented May 28, 2018

Hi @juliantaylor

Sorry for the delay in answering :

  1. "but this code is not one of these case" I wonder why in your code documentation mention this : "This file is for the definitions of SIMD vectorized operations.. Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...) In future it may contain other instruction sets like AVX or NEON ..." I was guided by your coments and code sample.

  2. "in your third example the compiler clearly optimized the code away, otherwise, a 0s result is impossible"
    -> the objdump -d generated for my example code with immintrin.h:

0000000000400560 <foo>:
  400560:       31 c0                   xor    %eax,%eax
  400562:       c5 fc 28 88 60 14 60    vmovaps 0x601460(%rax),%ymm1
  400569:       00
  40056a:       c5 f4 58 80 60 10 60    vaddps 0x601060(%rax),%ymm1,%ymm0
  400571:       00
  400572:       48 83 c0 20             add    $0x20,%rax
  400576:       c5 fc 11 80 40 18 60    vmovups %ymm0,0x601840(%rax)
  40057d:       00
  40057e:       48 3d 00 04 00 00       cmp    $0x400,%rax
  400584:       75 dc                   jne    400562 <foo+0x2>
  400586:       c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
  40058a:       bf 24 06 40 00          mov    $0x400624,%edi
  40058f:       b0 01                   mov    $0x1,%al
  400591:       c5 fa 5a 05 c3 16 20    vcvtss2sd 0x2016c3(%rip),%xmm0,%xmm0        # 601c5c <a+0x3fc>
  400598:       00
  400599:       c5 f8 77                vzeroupper
  40059c:       e9 7f fe ff ff          jmpq   400420 <printf@plt>
  4005a1:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4005a8:       00 00 00
  4005ab:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

and with -O3 -march=haswell

0000000000400560 <foo>:
  400560:       31 d2                   xor    %edx,%edx
  400562:       8b 8a 60 14 60 00       mov    0x601460(%rdx),%ecx
  400568:       b8 00 e1 f5 05          mov    $0x5f5e100,%eax
  40056d:       03 8a 60 10 60 00       add    0x601060(%rdx),%ecx
  400573:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400578:       ff c8                   dec    %eax
  40057a:       75 fc                   jne    400578 <foo+0x18>
  40057c:       89 8a 60 18 60 00       mov    %ecx,0x601860(%rdx)
  400582:       48 83 c2 04             add    $0x4,%rdx
  400586:       48 81 fa 00 04 00 00    cmp    $0x400,%rdx
  40058d:       75 d3                   jne    400562 <foo+0x2>
  40058f:       8b 35 cb 12 20 00       mov    0x2012cb(%rip),%esi        # 601860 <a>
  400595:       bf 24 06 40 00          mov    $0x400624,%edi
  40059a:       31 c0                   xor    %eax,%eax
  40059c:       e9 7f fe ff ff          jmpq   400420 <printf@plt>
  4005a1:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4005a8:       00 00 00
  4005ab:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

As you can see is much more optimized when we use immintrin.h

  1. due to overheads numpy cannot effectively use the L1 cache so most micro optimizations at assembly level are masked by memory latency.

-> under what cases? I have not seen that in the numpy that I have built with this approach ?

  1. "micro benchmarks are useful for this code, if you cannot show a speedup in the best possible scenario a macro benchmark will also not change" There might be some outline cases that the microbenchmark is not considering, min / max and standard deviation are factor to consider for benchmarking. But I am free to make a script that take these into consideration with your microbenchmarks.

"I am all for adding more modern is to numpy, but it has to be in places it makes sense. The basic math functions, unfortunately, do not make sense ins numpy current design.
Places, where it makes sense, are for example the boolean functions, complex math, convolutions and accumulative operations. If you want to make the largest impact please work on these areas. I gladly can provide support ion those areas."

Ok so I am in, please guide me what areas are suitable for the proper use of AVX instructions in Numpy. I am very interested to provide the community a better and faster Numpy that enhance the HW they have as much as possible, in the meantime please do not revert the change until we have the other sections. If we have solid numbers measured that shows my patch decrease performance I will be the one that makes revert to it. In the meantime I am more than happy to work on the other areas you mention, please point me at what part of numpy code the SIMD instructions are a good candidate.

@juliantaylor
Copy link
Contributor Author

juliantaylor commented May 28, 2018

when I wrote this code aeons ago I assumed avx wouldprovide a performance benefit, unfortunately this could never be achieved, else I would have added it myself.
When the code was written compilers where also still somewhat worse at vectorizing this, while the mainstream linux compilers could do it, mac (old clangs or old gccs) and windows compilers (vsvc) where not up to the task. So we manually vectorized the code.
Today we could probably replace it with compiler vectorized code, though its probably not worth the effort testing all the platforms again (in particular windows python2 is probably still in the same situation as back then).

what version of gcc are you using to compile your testcase? gcc 7 produces this perfectly find code:

 6f8:   c5 fd 6f 04 06          vmovdqa (%rsi,%rax,1),%ymm0
 6fd:   c5 fd fe 04 01          vpaddd (%rcx,%rax,1),%ymm0,%ymm0
 702:   c5 fd 7f 04 02          vmovdqa %ymm0,(%rdx,%rax,1)
 707:   48 83 c0 20             add    $0x20,%rax
 70b:   48 3d 00 04 00 00       cmp    $0x400,%rax
 711:   75 e5                   jne    6f8 <foo+0x28>

gcc-5 produces the same code and I am very sure you get the similar code even when going back to around version 4.6-4.7

The reason numpy cannot effectively use the L1 cache is because it has huge overheads in the most basic operations, the code we are talking about here is executed via the very generic universal function infrastructure. This infrastructure (and python) add huge amounts of overheads that besides polluting the cache between each operations overshadow the computational cost of the L1 operation. ufunc overhead is around 60-65% of the computing cost for arrays which fit into the L1 cache. So many microoptimizations squeezing out a cycle here and there will have almost no effect. In order to use AVX effectively for these operations we would need some form of operation merging/lazy evaluation as for example numexpr does.

As I do consistently see small slowdown using avx compared to sse in these functions and as it is compile time enabled will see very little testing I still favor reverting this code (and not adding my variant). Though I'll leave that decision to the more active developers. The probable overlap issue can simply be fixed by replacing the 16 byte distance checks with 32/64 byte checks at compile time.

An area which comes to mind which would profit is convolutions. the 1d convolutions are next to the offloaded matrix multiplications the densest computation type in numpy so a natural fit to vectorization.
I experimented with it a few years back on my vect-convolve branch:
juliantaylor@975474e
This branch uses sse2 but I tested avx and did see pretty large gains. Unfortunately I seem to have lost the avx variations of this code. Also it was never in a mergeable state and needs considerable work to finish.
Even more beneficial would be working on something similar for 2d-nd convolutions in scipy.ndimage but that code is pretty difficult to deal with unfortunately.

For something simpler the boolean operations in the same simd.inc.src file would probably benefit from an avx2 implementation.

@VictorRodriguez
Copy link
Contributor

I am using GCC 8, but GCC 7 might also give the same on the same example code I posted before, what code are you compiling and which flags were used?. I really appreciate you share the branch and code for vect-convolve I will take the AR to merge it first as SSE and then as AVX. As soon as ai have a functional and well measure pull request i will add you as a reviewer. Thanks a lot for the guidance and help

@charris
Copy link
Member

charris commented Aug 7, 2018

This is more change than I would like in a minor release, so pushing the milestone back to 1.16.

@juliantaylor
Copy link
Contributor Author

the point is more about reverting the change if no one can show its actually faster or fixing the overlap bug

@VictorRodriguez
Copy link
Contributor

@juliantaylor I overlap bug on my TODO list ( just delayed by work stuff). 1.15 has just been released, can we wait a bit to see how the community reacts and prove if it is faster for them ?

@mattip
Copy link
Member

mattip commented Oct 12, 2018

Will this see work before 1.16 feature freeze?

@charris
Copy link
Member

charris commented Nov 14, 2018

Going to push this off. I suggest that the release note be omitted until we settle on whether we can do this or not. @VictorRodriguez What is the status of the overlap bug?

@charris charris modified the milestones: 1.16.0 release, 1.17.0 release Nov 14, 2018
@VictorRodriguez
Copy link
Contributor

@charris i have a WIP patch, should I send it out? is not working yet

@charris
Copy link
Member

charris commented Nov 15, 2018

@VictorRodriguez Might put it up for discussion, with a link to this. It certainly won't make it into 1.16.

@juliantaylor
Copy link
Contributor Author

here is a fix for the overlap issue #12398

@tylerjereddy
Copy link
Contributor

Ping -- this has 1.17 milestone-- is it realistically going to be ready for that?

@VictorRodriguez
Copy link
Contributor

@tylerjereddy could you please remind me what is the goal that needs to be before 1.17 milestone? I had the impression this was fixed with the overlap commit fix. Thanks

@tylerjereddy
Copy link
Contributor

@VictorRodriguez Are you saying this PR isn't needed anymore? I'm just probing here to see what is going on given ~6 months since last comment. If we don't need the PR that's even easier.

@VictorRodriguez
Copy link
Contributor

@tylerjereddy the PR is removing my patch for AVX support for numpy, @juliantaylor point is that the compiler can do that job and is not necessary to do it by intrinsics with my patch. The only argue thing could be to provide numbers w/ and w/o the patch.

@charris
Copy link
Member

charris commented Jun 14, 2019

Going to push this off to 1.18, the "cleanup" release. I like the idea of leaving the optimizations to the compilers, and the compilers are getting smarter. @juliantaylor Might take a look at what other optimizations we might want to leave to the compilers, the simpler we can keep our code, the better.

@charris charris modified the milestones: 1.17.0 release, 1.18.0 release Jun 14, 2019
@Guobing-Chen
Copy link

I checked the generated binary code of reverting the AVX512/AVX2 intrinsics, it's different as when with these intrinsic. The experiement is done with Numpy1.17.2, and GCC 9.2.1 (gcc version 9.2.1 20190913 gcc-9-branch@275694).

Let's take sse2_binary_scalar1_divide_DOUBLE function as an example. Without the intrinsic (means reverting Victor's patch of intrinsic implemenation), GCC generated binary code (with march=skylake-avx512) will still use xmm to do the main block divide, while with Victor's patch, the binary code goes with zmm based AVX512 dividing for the main block.
Reverting Victor's patch:

00000000001e0bc0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0>: 
  1e0bc0:       c5 fb 10 0a             vmovsd xmm1,QWORD PTR [rdx]
  1e0bc4:       49 89 f8                mov    r8,rdi
  1e0bc7:       c5 fb 12 d1             vmovddup xmm2,xmm1
  1e0bcb:       41 83 e0 0f             and    r8d,0xf
  1e0bcf:       0f 85 1b 02 00 00       jne    1e0df0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x230>
  1e0bd5:       49 89 c8                mov    r8,rcx
  1e0bd8:       49 89 f1                mov    r9,rsi
  1e0bdb:       31 c0                   xor    eax,eax
  1e0bdd:       49 83 e0 fe             and    r8,0xfffffffffffffffe
  1e0be1:       41 80 e1 0f             and    r9b,0xf
  1e0be5:       0f 84 42 02 00 00       je     1e0e2d <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x26d>
  1e0beb:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
  1e0bf0:       4c 39 c0                cmp    rax,r8
  1e0bf3:       7d 1a                   jge    1e0c0f <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x4f>
  1e0bf5:       0f 1f 00                nop    DWORD PTR [rax]
  1e0bf8:       c5 f9 10 2c c6          vmovupd xmm5,XMMWORD PTR [rsi+rax*8]
  1e0bfd:       c5 d1 5e c2             vdivpd xmm0,xmm5,xmm2
  1e0c01:       c5 f8 29 04 c7          vmovaps XMMWORD PTR [rdi+rax*8],xmm0
  1e0c06:       48 83 c0 02             add    rax,0x2
  1e0c0a:       49 39 c0                cmp    r8,rax
  1e0c0d:       7f e9                   jg     1e0bf8 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x38>
   ...

With Victor's patch (which is also the default of numpy):

00000000001e0c40 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0>:
  1e0c40:       c5 fb 10 12             vmovsd xmm2,QWORD PTR [rdx]
  1e0c44:       49 89 f8                mov    r8,rdi
  1e0c47:       62 f2 fd 48 19 ca       vbroadcastsd zmm1,xmm2
  1e0c4d:       41 83 e0 3f             and    r8d,0x3f
  1e0c51:       0f 85 09 02 00 00       jne    1e0e60 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x220>
  1e0c57:       49 89 c8                mov    r8,rcx
  1e0c5a:       49 89 f1                mov    r9,rsi
  1e0c5d:       31 c0                   xor    eax,eax
  1e0c5f:       49 83 e0 f8             and    r8,0xfffffffffffffff8
  1e0c63:       41 83 e1 3f             and    r9d,0x3f
  1e0c67:       0f 84 53 02 00 00       je     1e0ec0 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x280>
  1e0c6d:       0f 1f 00                nop    DWORD PTR [rax]
  1e0c70:       4c 39 c0                cmp    rax,r8
  1e0c73:       7d 20                   jge    1e0c95 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x55>
  1e0c75:       0f 1f 00                nop    DWORD PTR [rax]
  1e0c78:       62 f1 fd 48 10 2c c6    vmovupd zmm5,ZMMWORD PTR [rsi+rax*8]
  1e0c7f:       62 f1 d5 48 5e c1       vdivpd zmm0,zmm5,zmm1
  1e0c85:       62 f1 fd 48 29 04 c7    vmovapd ZMMWORD PTR [rdi+rax*8],zmm0
  1e0c8c:       48 83 c0 08             add    rax,0x8
  1e0c90:       49 39 c0                cmp    r8,rax
  1e0c93:       7f e3                   jg     1e0c78 <sse2_binary_scalar2_divide_DOUBLE.lto_priv.0+0x38>
   ...

With the latency data from Intel's intruction manual, the main loop latency of AVX512 intrinsic version is 8+23+5 = 36, which is for 8 double:

  1e0c78:       62 f1 fd 48 10 2c c6    vmovupd zmm5,ZMMWORD PTR [rsi+rax*8]
  1e0c7f:       62 f1 d5 48 5e c1       vdivpd zmm0,zmm5,zmm1
  1e0c85:       62 f1 fd 48 29 04 c7    vmovapd ZMMWORD PTR [rdi+rax*8],zmm0

While the reverted version go with xmm, which main loop latency is 6+14+5 = 25, which is for 2 double:

  1e0bf8:       c5 f9 10 2c c6          vmovupd xmm5,XMMWORD PTR [rsi+rax*8]
  1e0bfd:       c5 d1 5e c2             vdivpd xmm0,xmm5,xmm2
  1e0c01:       c5 f8 29 04 c7          vmovaps XMMWORD PTR [rdi+rax*8],xmm0

And, the AVX2 version with the intrinsic is 7+14+5 = 26 for 4 double (I don't paste the binary code here). So suppose the AVX512 intrinsic version costs 4.5 cycles for a double, while AVX2 costs 6.5 cycles for a double, and SSE version with 12.5 cycles for a double. Based on this, I think the Victor's intrinsic patch still worth to be kept in Numpy as beating GCC's default generation of AVX512/AVX2.

I attached the full binary code of this function for (AVX512_reverted, AVX512_intrinsic, AVX2_reverted, AVX2_intrinsic) which is dumped with objdump from _multiarray_umath.cpython-37m-x86_64-linux-gnu.so. The full objdump file is too big that cannot be uploaded, ping me if you want, I can send you with email.
simd_intrinsic_revert_compare.txt

@charris
Copy link
Member

charris commented Nov 25, 2019

I'm going to push this off again. It needs a rebase, and after the optimization infrastructure patches go in might be a better time to revisit.

@Guobing-Chen
Copy link

I'm going to push this off again. It needs a rebase, and after the optimization infrastructure patches go in might be a better time to revisit.

@charris Guess here 'the optimization infrastructure patches' you mean the PR #13516. And Yes, agree that with this PR-13516 in, the intrinsic code for this topic related need to be revisited to use wrapper generic intrinsic as defined by PR-13516.

@seberg
Copy link
Member

seberg commented May 6, 2020

Thanks! Too bad we never put this in. But by now, we do have AVX on I think all of these. Added by @r-devulap, so I am not 100% that covered everything, I think it covers enough to close the PR.

@seberg seberg closed this May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants