-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Use AVX for float32 implementation of np.exp & np.log #13134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX2 __m256 | ||
avx2_fmadd(__m256 a, __m256 b, __m256 c) | ||
{ | ||
return _mm256_add_ps(_mm256_mul_ps(a, b), c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx2 has fmadd natively, does it not have the required semantics for this and avx512 fmadd does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, but it additionally requires "fma" attribute. FMA was introduced post AVX2, so it wouldn't work with the older CPU's which have AVX2 but no FMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm right
Though I wonder if this has an effect on the accuracy bounds you quoted for avx2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max ULP error and relative error reported above actually holds true for both AVX2 and AVX512 implementations. They were computed by comparing the output of these functions to glibc's scalar implementation of exp and log in double precision (higher precision) across all 32-bit floats (2^32 values). I did that for both AVX2 and AVX512 versions and turns out the max error is the same for both of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the two variants do probably not return the exact same results?
For reproducibility it would be preferable to minimize the amount of different results on different hardware generations.
Only supporting this for AVX2+fma hardware is probably not unreasonable.
Though do you know how common avx2 hardware without fma is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I did run an exhaustive check for all float32 and I verified that avx2/avx512 version of log and exp do produce the exact same result (0 ULP error, 0.0 absolute error).
Intel introduced FMA in Haswell. So, Ivy Bridge and Sandy Bridge (released in 2011-2012) don't have FMA's. Since they seem to be producing the same result, I would like to keep it as is if that is okay with you.
very nice, can you provide the code you used to compare the implementations?
please do not add sin/cos to this PR, lets do that one at the time.
This looks very nice and uses our runtime detection so automatic performance improvement for users with the right hardware using gcc. It is always a bit dodgy to implement the basic functions ourselves and not using libm for it as we lose out on potential improvements there (e.g. glibc has avx512 code for the these functions and more). Would be interesting to test this against a few other projects like scipy and see if tests break. |
Thank you for reviewing this patch! I did run the test suite in SciPy with this version of NumPy on my SkylakeX machine and didn't see any tests fail. |
Do you have a reference for the algorithm used here? |
I would like to compare this glibc's exp implementation (there is an old branch of mine which allows using it from numpy #7865) but mostly just out of curiosity. A concern is that we do not have good coverage if these basic math functions actually compute correct results over the full range, while I trust it is correct currently to keep it correct coverage would be good. |
} | ||
|
||
NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX2 __m256 | ||
avx2_get_partial_load_mask(const npy_int num_elem, const npy_int total_elem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the functions all need to be static as we do not need external linkage for them
The algorithm is based on methods presented in the book "Elementary Functions: Algorithms and Implementation" by Jean-Michel Muller. Basically, algorithms for transcendental functions is broken into 3 steps:
Also, I learn the trick for improving accuracy for log function (keeping the range reduced value, which is basically the mantissa, greater than 1/sqrt(2)) from here: https://stackoverflow.com/questions/45770089/efficient-implementation-of-log2-m256d-in-avx2 |
I did compare the accuracy of this implementation with glibc's libmvec library (32-bit version) and turns out libmvec has a slightly worse accuracy than this implementation. As mentioned above, both implementations were compared against glibc's scalar double implementation.
I have another patch ready which implements float32 implementation of sine and cosine using AVX2 and AXV512 with a max ULP of 1.49. If you like, I can add that to this pull request as well. |
very nice, can you provide the code you used to compare the implementations? please do not add sin/cos to this PR, lets do that one at the time. |
The code is fairly simple actually: Have a for loop iterating over all 32 bit floats and for each float:
you can then compute the max ULP error and do the same to get max relative error. You can repeat the same loop to get max ULP error for the libmvec implementation by simply calling _ZGVeN16v_expf instead of the AVX2_exp_FLOAT. |
There are warnings when compiling with various options, are some of the #ifdef clauses not quite correct? |
Could you please tell me which compiler and flags you are using? With gcc 7.3.0, it seems to build fine. There are some warnings, but nothing seems to be from this patch. If you are using clang6.0.0, then this patch fails to build because of this error (gcc or even clang7.0 builds fine): _configtest.c:7:20: error: instruction requires: AVX-512 ISA |
The CI is failing. You should see a red X next to the changeset label above:
Clicking on that X should open a set of links to failing CI runs: travis ci, and azure macos and 32 bit linux. The CI fails on compiler warnings. |
Linux 32 bit is failing - |
yeah, I still don't fully understand why that is failing. Do you have any suggestions? |
Is it worth trying |
Even if that did work, I guess you'd still need some kind of guard so that you can still import NumPy on a "regular" compile without the -m flag on 32-bit. Other random thought--that Docker container will likely have pretty minimal libs installed so if you're expecting something specific it could be missing. |
hmm, I don't have any dependencies on other libs. If GCC supports avx512f instrinsics, then it compiles it. It seems to be using gcc7, which does support it, so I assume those functions should be compiled fine and should exist. At the time of loading .so file (when importing numpy), if cpuid/xgetbv returns a successful check for avx512f, then it points to the avx512f version of the function FLOAT_exp_avx512f, which is where the failure seems to occur. |
fired up my bionic 32 bit chroot. Forcing the archtecture with |
We have had reports of "illegal instruction" in the past. Probably the best course is check for one of the |
Thanks for taking a look! I am still not entirely sure what's going on, but i got to run now. I will get back to this tomorrow again. Thanks for your help :) |
SSE and AVX probably isn't available on 32 bit processors by default. Gcc for the i686 arch assumes Pentium Pro instruction set (1995), later additions need flags. I don't think we support non SSE2 hardware any more, at least on 64 bit processors. There was a discussion at some point. Anyway https://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/i386-and-x86_002d64-Options.html shows the options. |
Does the CI system test the build of every commit in the PR separately? or does it just build one time that includes all the commits in the PR? |
it builds the most recent commit if all the commits you push at once |
Correct, we don't support non SSE2 systems anymore. 32-bit Windows binaries are compiled with SSE2, from
Copying this comment on the PR that introduced AVX code:
That last bit was about installers built with ATLAS, it no longer applies to our current wheels. I'm not sure about how this works on Windows now, but given that the AVX
This is a nice discussion on portability of AVX code on Windows: https://randomascii.wordpress.com/2016/12/05/vc-archavx-option-unsafe-at-any-speed/. It looks like we use the
It would be very useful to document better how this actually works at the moment. We don't use a
So today we require SSE2, and not AVX on Windows. On Linux we do ship with AVX enabled, because there we compile with GCC and it produces portable binaries. Searching for "illegal instruction" Windows gives only a couple of issues, one that was fixed and the others that were OpenBLAS or VM related. No regular report ever of AVX instructions sneaking into our Windows wheels. Hope this helps; additions and corrections very welcome! |
Re 32-bit Linux: there are a number of issues related to VM use where detected CPU instructions are wrong. On CI could work around that perhaps by explicitly disabling AVX512? (haven't thought about that suggestion very hard - whatever works I'd say, 32-bit performance optimizations aren't very interesting). |
The remaining failures are only on linux32 (which does not have any sse/avx support) and when setting |
I was resolving merge conflict on the release doc and seems like two tests failed with "The install of mingw was NOT successful." I think this has nothing to do with my patch. |
Looks like a fluke connection failure--I've restarted the job for you |
thank you, they all pass now. any word on merging this patch yet? |
Do we have any tests already like the ULP test you describe? It would it be nice to add the ULP test code as a c function to At some point we could add "correct value" testing for numpy math functions, much as python tests cmath using a file with expected results generated by MPFR |
Could you please elaborate on the purpose of such a test? Different algorithms for transcendental functions will not necessarily produce the same exact output, so we can't check for a "correct value". Perhaps we can check if they are within certain acceptable threshold, but that raises two questions: (1) what threshold to use? and (2) which float32 values should we validate in the test? |
The threshold is TBD, for instance in the test I linked to It is a lot of work to create the "expected" result. I started an exchange a few years ago on how to generate the file for 64-bit complex numbers and 32 bit floats, maybe today it would be easier. |
sounds good. I am almost done with a another patch vectorizing sine and cosine. Would you mind if I submit the ULP test as part of that patch and we can merge this as is? |
This commit implements vectorized single precision exponential and natural log using AVX2 and AVX512. Accuracy: | Function | Max ULP Error | Max Relative Error | |----------|---------------|--------------------| | np.exp | 2.52 | 2.1E-07 | | np.log | 3.83 | 2.4E-07 | Performance: (1) Micro-benchmarks: measured execution time of np.exp and np.log using timeit package in python. Each function is executed 1000 times and this is repeated 100 times. The standard deviation for all the runs was less than 2% of their mean value and hence not included in the data. The vectorized implementation was upto 7.6x faster than the scalar version. | Function | NumPy1.16 | AVX2 | AVX512 | AVX2 speedup | AVX512 speedup | | -------- | --------- | ------ | ------ | ------------ | -------------- | | np.exp | 0.395s | 0.112s | 0.055s | 3.56x | 7.25x | | np.log | 0.456s | 0.147s | 0.059s | 3.10x | 7.64x | (2) Logistic regression: exp and log are heavily used in training neural networks (as part of sigmoid activation function and loss function respectively). This patch significantly speeds up training a logistic regression model. As an example, we measured how much time it takes to train a model with 15 features using 1000 training data points. We observed a 2x speed up to train the model to achieve a loss function error < 10E-04. | Function | NumPy1.16 | AVX2 | AVX512 | AVX2 speedup | AVX512 speedup | | -------------- | ---------- | ------ | ------ | ------------ | -------------- | | logistic.train | 121.0s | 75.02s | 60.60s | 1.61x | 2.02x |
1) Got rid of @isa@_cmp_mask helper function, since clang expects the compare operator in _mm@vsize@_cmp_ps to be a compile time constant 2) Converted all helper functions to static functions
clang6.0 fails to compile this code: __asm__ __volatile__ ( "vpaddd %zmm1, %zmm2, %zmm3\n\t" ); Note that this is a known issue in clang6.0. clang7.0 and gcc does not have this problem. This fails to set the flag HAVE_LINK_AVX512F. Hence, the AVX512F version of exp and log doesn't get built. If AVX512F is detected during runtime, instead of choosing to run the AVX2 version, it will end up running scalar version.
1) use __builtin_cpu_supports("avx512f") only for gcc ver >= 5 2) Introduced two new macro's: HAVE_ATTRIBUTE_TARGET_@ISA@_WITH_INTRINSICS for ensuring compiler can compile functions that use intrinsics and are compiled with avx2/avx512f attributes
Added a missing NPY_HAVE_SSE2_INTRINSICS guard
Hi, just checking in on the status of the patch. Let me know if there is anything else you need from me. |
Thanks @r-devulap |
I am not really worried, but I stumbled on this looking up recent speed improvements in NumPy. The speedups here probably mean a slightly larger error on the functions (compared to non-vectorized/fastmath versions)? At what point do we worry about such precision changes? |
xref #13515 |
@seberg for comparison against libm (scalar) and libmvec ( glibc's vector implementation):
|
This commit implements vectorized single precision exponential and
natural log using AVX2 and AVX512.
Accuracy:
Performance:
(1) Micro-benchmarks: measured execution time of np.exp and np.log using
timeit package in python. Each function is executed 1000 times and this
is repeated 100 times. The standard deviation for all the runs was less
than 2% of their mean value and hence not included in the data. The
vectorized implementation was upto 7.6x faster than the scalar version.
(2) Logistic regression: exp and log are heavily used in training neural
networks (as part of sigmoid activation function and loss function
respectively). This patch significantly speeds up training a logistic
regression model. As an example, we measured how much time it takes to
train a model with 15 features using 1000 training data points. We
observed a 2x speed up to train the model to achieve a loss function
error < 10E-04.