Skip to content

ENH: Use AVX for float32 implementation of np.exp & np.log #13134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 20, 2019

Conversation

r-devulap
Copy link
Member

This commit implements vectorized single precision exponential and
natural log using AVX2 and AVX512.

Accuracy:

Function Max ULP Error Max Relative Error
np.exp 2.52 2.1E-07
np.log 3.83 2.4E-07

Performance:

(1) Micro-benchmarks: measured execution time of np.exp and np.log using
timeit package in python. Each function is executed 1000 times and this
is repeated 100 times. The standard deviation for all the runs was less
than 2% of their mean value and hence not included in the data. The
vectorized implementation was upto 7.6x faster than the scalar version.

Function NumPy1.16 AVX2 AVX512 AVX2 speedup AVX512 speedup
np.exp 0.395s 0.112s 0.055s 3.56x 7.25x
np.log 0.456s 0.147s 0.059s 3.10x 7.64x

(2) Logistic regression: exp and log are heavily used in training neural
networks (as part of sigmoid activation function and loss function
respectively). This patch significantly speeds up training a logistic
regression model. As an example, we measured how much time it takes to
train a model with 15 features using 1000 training data points. We
observed a 2x speed up to train the model to achieve a loss function
error < 10E-04.

Function NumPy1.16 AVX2 AVX512 AVX2 speedup AVX512 speedup
logistic.train 121.0s 75.02s 60.60s 1.61x 2.02x

@charris charris changed the title ENH: vectorizing float32 implementation of np.exp & np.log ENH: Use AVX for float32 implementation of np.exp & np.log Mar 16, 2019
NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX2 __m256
avx2_fmadd(__m256 a, __m256 b, __m256 c)
{
return _mm256_add_ps(_mm256_mul_ps(a, b), c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx2 has fmadd natively, does it not have the required semantics for this and avx512 fmadd does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but it additionally requires "fma" attribute. FMA was introduced post AVX2, so it wouldn't work with the older CPU's which have AVX2 but no FMA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm right
Though I wonder if this has an effect on the accuracy bounds you quoted for avx2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max ULP error and relative error reported above actually holds true for both AVX2 and AVX512 implementations. They were computed by comparing the output of these functions to glibc's scalar implementation of exp and log in double precision (higher precision) across all 32-bit floats (2^32 values). I did that for both AVX2 and AVX512 versions and turns out the max error is the same for both of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the two variants do probably not return the exact same results?
For reproducibility it would be preferable to minimize the amount of different results on different hardware generations.
Only supporting this for AVX2+fma hardware is probably not unreasonable.

Though do you know how common avx2 hardware without fma is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I did run an exhaustive check for all float32 and I verified that avx2/avx512 version of log and exp do produce the exact same result (0 ULP error, 0.0 absolute error).

Intel introduced FMA in Haswell. So, Ivy Bridge and Sandy Bridge (released in 2011-2012) don't have FMA's. Since they seem to be producing the same result, I would like to keep it as is if that is okay with you.

very nice, can you provide the code you used to compare the implementations?

please do not add sin/cos to this PR, lets do that one at the time.

@juliantaylor
Copy link
Contributor

This looks very nice and uses our runtime detection so automatic performance improvement for users with the right hardware using gcc.

It is always a bit dodgy to implement the basic functions ourselves and not using libm for it as we lose out on potential improvements there (e.g. glibc has avx512 code for the these functions and more).
Then again we gain more control over when they change and are more reliable over different platforms.

Would be interesting to test this against a few other projects like scipy and see if tests break.

@r-devulap
Copy link
Member Author

This looks very nice and uses our runtime detection so automatic performance improvement for users with the right hardware using gcc.

It is always a bit dodgy to implement the basic functions ourselves and not using libm for it as we lose out on potential improvements there (e.g. glibc has avx512 code for the these functions and more).
Then again we gain more control over when they change and are more reliable over different platforms.

Would be interesting to test this against a few other projects like scipy and see if tests break.

Thank you for reviewing this patch! I did run the test suite in SciPy with this version of NumPy on my SkylakeX machine and didn't see any tests fail.

@juliantaylor
Copy link
Contributor

Do you have a reference for the algorithm used here?

@juliantaylor
Copy link
Contributor

I would like to compare this glibc's exp implementation (there is an old branch of mine which allows using it from numpy #7865) but mostly just out of curiosity.

A concern is that we do not have good coverage if these basic math functions actually compute correct results over the full range, while I trust it is correct currently to keep it correct coverage would be good.
Then again the same thing applies to our fallback implementations in case the system libm does not provide decent functions... so it is not any worse than the status quo if we merge as is.

}

NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX2 __m256
avx2_get_partial_load_mask(const npy_int num_elem, const npy_int total_elem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions all need to be static as we do not need external linkage for them

@r-devulap
Copy link
Member Author

r-devulap commented Mar 21, 2019

Do you have a reference for the algorithm used here?

The algorithm is based on methods presented in the book "Elementary Functions: Algorithms and Implementation" by Jean-Michel Muller. Basically, algorithms for transcendental functions is broken into 3 steps:

  1. Range reduction (chapter 9 talks about this for sin, cos, exp and log). I ended up using the Cody-Waite range reduction method which is very SIMD friendly.
  2. Polynomial approximation in the reduced range (using a mini-max polynomial approximation), the coefficients of which can be computed by solving Remez's algorithm. Boost library provides a nice solver for this and I used it to experiment with different polynomials and used one that provided reasonable accuracy.
  3. Reconstruction of the original result, which is a simple multiplication or addition in this case.

Also, I learn the trick for improving accuracy for log function (keeping the range reduced value, which is basically the mantissa, greater than 1/sqrt(2)) from here: https://stackoverflow.com/questions/45770089/efficient-implementation-of-log2-m256d-in-avx2

@r-devulap
Copy link
Member Author

r-devulap commented Mar 21, 2019

I would like to compare this glibc's exp implementation (there is an old branch of mine which allows using it from numpy #7865) but mostly just out of curiosity.

A concern is that we do not have good coverage if these basic math functions actually compute correct results over the full range, while I trust it is correct currently to keep it correct coverage would be good.
Then again the same thing applies to our fallback implementations in case the system libm does not provide decent functions... so it is not any worse than the status quo if we merge as is.

I did compare the accuracy of this implementation with glibc's libmvec library (32-bit version) and turns out libmvec has a slightly worse accuracy than this implementation. As mentioned above, both implementations were compared against glibc's scalar double implementation.

Function libmvec max ULP this patch max ULP
sine 2.47 1.49
cosine 2.39 1.49
exp 2.93 2.52
log 3.96 3.83

I have another patch ready which implements float32 implementation of sine and cosine using AVX2 and AXV512 with a max ULP of 1.49. If you like, I can add that to this pull request as well.

@juliantaylor
Copy link
Contributor

very nice, can you provide the code you used to compare the implementations?

please do not add sin/cos to this PR, lets do that one at the time.

@r-devulap
Copy link
Member Author

r-devulap commented Mar 21, 2019

very nice, can you provide the code you used to compare the implementations?

please do not add sin/cos to this PR, lets do that one at the time.

The code is fairly simple actually:

Have a for loop iterating over all 32 bit floats and for each float:

  1. double my_value = (double) AVX2_exp_FLOAT(); // upscale the float to a double
  2. double reference_value = exp(); // glibc's scalar function
  3. double ULP_error = compute_ULP(my_value, reference_value); // a small helper function which computes the error is terms of ULP

you can then compute the max ULP error and do the same to get max relative error. You can repeat the same loop to get max ULP error for the libmvec implementation by simply calling _ZGVeN16v_expf instead of the AVX2_exp_FLOAT.

@mattip
Copy link
Member

mattip commented Mar 27, 2019

There are warnings when compiling with various options, are some of the #ifdef clauses not quite correct?

@r-devulap
Copy link
Member Author

There are warnings when compiling with various options, are some of the #ifdef clauses not quite correct?

Could you please tell me which compiler and flags you are using? With gcc 7.3.0, it seems to build fine. There are some warnings, but nothing seems to be from this patch.

If you are using clang6.0.0, then this patch fails to build because of this error (gcc or even clang7.0 builds fine):

_configtest.c:7:20: error: instruction requires: AVX-512 ISA
__asm__ volatile("vpaddd %zmm1, %zmm2, %zmm3");

@mattip
Copy link
Member

mattip commented Mar 27, 2019

The CI is failing. You should see a red X next to the changeset label above:

BUG: Fixing compile time error for clang                               X f802e92 

Clicking on that X should open a set of links to failing CI runs: travis ci, and azure macos and 32 bit linux. The CI fails on compiler warnings.

@mattip
Copy link
Member

mattip commented Mar 27, 2019

Linux 32 bit is failing - ImportError: /numpy/build/testenv/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-i386-linux-gnu.so: undefined symbol: AVX512F_exp_FLOAT

@r-devulap
Copy link
Member Author

Linux 32 bit is failing - ImportError: /numpy/build/testenv/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-i386-linux-gnu.so: undefined symbol: AVX512F_exp_FLOAT

yeah, I still don't fully understand why that is failing. Do you have any suggestions?

@tylerjereddy
Copy link
Contributor

Is it worth trying -march=skylake-avx512 or similar compile flag for the 32-bit Ubuntu docker container? Maybe 32-bit linux requires a little more convincing to emit the appropriate code / use appropriate libs on that platform?

@tylerjereddy
Copy link
Contributor

Even if that did work, I guess you'd still need some kind of guard so that you can still import NumPy on a "regular" compile without the -m flag on 32-bit.

Other random thought--that Docker container will likely have pretty minimal libs installed so if you're expecting something specific it could be missing.

@r-devulap
Copy link
Member Author

r-devulap commented Mar 27, 2019

hmm, I don't have any dependencies on other libs. If GCC supports avx512f instrinsics, then it compiles it. It seems to be using gcc7, which does support it, so I assume those functions should be compiled fine and should exist.

At the time of loading .so file (when importing numpy), if cpuid/xgetbv returns a successful check for avx512f, then it points to the avx512f version of the function FLOAT_exp_avx512f, which is where the failure seems to occur.

@mattip
Copy link
Member

mattip commented Mar 27, 2019

fired up my bionic 32 bit chroot. Forcing the archtecture with CFLAGS='-march=skylake-avx512' indeed allows the module to load, but then it crashes with Illegal instruction when running tests. Apparently the check is not quite good enough for a chroot.

@mattip
Copy link
Member

mattip commented Mar 27, 2019

We have had reports of "illegal instruction" in the past. Probably the best course is check for one of the __SSE__ defines on the build platform, and skip all this code if it is not defined. i386 is a bit of a corner case, is there even any hardware still available for 32 bit that has AVX512 support? How does the other sse/avx code get around this?

@r-devulap
Copy link
Member Author

Thanks for taking a look! I am still not entirely sure what's going on, but i got to run now. I will get back to this tomorrow again. Thanks for your help :)

@charris
Copy link
Member

charris commented Mar 27, 2019

SSE and AVX probably isn't available on 32 bit processors by default. Gcc for the i686 arch assumes Pentium Pro instruction set (1995), later additions need flags. I don't think we support non SSE2 hardware any more, at least on 64 bit processors. There was a discussion at some point. Anyway https://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/i386-and-x86_002d64-Options.html shows the options.

@r-devulap
Copy link
Member Author

r-devulap commented Mar 28, 2019

Does the CI system test the build of every commit in the PR separately? or does it just build one time that includes all the commits in the PR?

@rgommers
Copy link
Member

Does the CI system test the build of every commit in the PR separately? or does it just build one time that includes all the commits in the PR?

it builds the most recent commit if all the commits you push at once

@rgommers
Copy link
Member

SSE and AVX probably isn't available on 32 bit processors by default. Gcc for the i686 arch assumes Pentium Pro instruction set (1995), later additions need flags. I don't think we support non SSE2 hardware any more, at least on 64 bit processors

Correct, we don't support non SSE2 systems anymore. 32-bit Windows binaries are compiled with SSE2, from msvccompiler.py:

        # msvc9 building for 32 bits requires SSE2 to work around a
        # compiler bug.
        if platform_bits == 32:
            self.compile_options += ['/arch:SSE2']
            self.compile_options_debug += ['/arch:SSE2']

Copying this comment on the PR that introduced AVX code:

linux always builds generic binaries, this doesn't change this as the unsupported
instructions will never be run if the cpu does not support it.

windows has some special case to build 3 variants with nosse, sse2 and sse3 and 
chooses at install time what to use.

That last bit was about installers built with ATLAS, it no longer applies to our current wheels. I'm not sure about how this works on Windows now, but given that the AVX

/*
 * This file is for the definitions of simd vectorized operations.
 *
 * Currently contains sse2 functions that are built on amd64, x32 or
 * non-generic builds (CFLAGS=-march=...)
 * In future it may contain other instruction sets like AVX or NEON detected
 * at runtime in which case it needs to be included indirectly via a file
 * compiled with special options (or use gcc target attributes) so the binary
 * stays portable.
 */


#ifndef __NPY_SIMD_INC
#define __NPY_SIMD_INC

#include "lowlevel_strided_loops.h"
#include "numpy/npy_common.h"
#include "numpy/npy_math.h"
#ifdef NPY_HAVE_SSE2_INTRINSICS      # WILL BE TRUE FOR OUR WHEELS
#include <emmintrin.h>        SSE2
#if !defined(_MSC_VER) || _MSC_VER >= 1600   # WILL BE TRUE FOR OUR WHEELS
#include <immintrin.h>         # AVX
#else
#undef __AVX2__
#undef __AVX512F__
#endif
#endif

This is a nice discussion on portability of AVX code on Windows: https://randomascii.wordpress.com/2016/12/05/vc-archavx-option-unsafe-at-any-speed/. It looks like we use the static inline trick there, however we don't seem to generate different code for SSE2-only vs AVX available (from simd.inc.src):

static NPY_INLINE int
run_binary_simd_@kind@_@TYPE@(char **args, npy_intp *dimensions, npy_intp *steps)
{
#if @vector@ && defined NPY_HAVE_SSE2_INTRINSICS
    @type@ * ip1 = (@type@ *)args[0];
    @type@ * ip2 = (@type@ *)args[1];
    @type@ * op = (@type@ *)args[2];
    npy_intp n = dimensions[0];
#if defined __AVX512F__
    const npy_intp vector_size_bytes = 64;
#elif defined __AVX2__
    const npy_intp vector_size_bytes = 32;
#else
    const npy_intp vector_size_bytes = 32;
#endif

It would be very useful to document better how this actually works at the moment. We don't use a /arch:AVX flag for building wheels, and the MSVC docs say:

__AVX__ Defined as 1 when the /arch:AVX or /arch:AVX2 compiler options are set and the compiler target is x86 or x64. Otherwise, undefined.
...

So today we require SSE2, and not AVX on Windows. On Linux we do ship with AVX enabled, because there we compile with GCC and it produces portable binaries.

Searching for "illegal instruction" Windows gives only a couple of issues, one that was fixed and the others that were OpenBLAS or VM related. No regular report ever of AVX instructions sneaking into our Windows wheels.

Hope this helps; additions and corrections very welcome!

@rgommers
Copy link
Member

Re 32-bit Linux: there are a number of issues related to VM use where detected CPU instructions are wrong. On CI could work around that perhaps by explicitly disabling AVX512? (haven't thought about that suggestion very hard - whatever works I'd say, 32-bit performance optimizations aren't very interesting).

@mattip
Copy link
Member

mattip commented Mar 28, 2019

The remaining failures are only on linux32 (which does not have any sse/avx support) and when setting NPY_RELAXED_STRIDES_CHECKING=0 on linux64. Windows passes CI tests.

@r-devulap
Copy link
Member Author

I was resolving merge conflict on the release doc and seems like two tests failed with "The install of mingw was NOT successful." I think this has nothing to do with my patch.

@tylerjereddy
Copy link
Contributor

Looks like a fluke connection failure--I've restarted the job for you

@r-devulap
Copy link
Member Author

Looks like a fluke connection failure--I've restarted the job for you

thank you, they all pass now. any word on merging this patch yet?

@mattip
Copy link
Member

mattip commented Apr 11, 2019

Do we have any tests already like the ULP test you describe? It would it be nice to add the ULP test code as a c function to numpy/core/src/umath/_umath_tests.c.src and a test to call the function for a vector of a few choice float32 values in numpy/core/tests/test_ufunc.py. I think it would be too slow to check the entire range.

At some point we could add "correct value" testing for numpy math functions, much as python tests cmath using a file with expected results generated by MPFR

@r-devulap
Copy link
Member Author

Could you please elaborate on the purpose of such a test? Different algorithms for transcendental functions will not necessarily produce the same exact output, so we can't check for a "correct value". Perhaps we can check if they are within certain acceptable threshold, but that raises two questions: (1) what threshold to use? and (2) which float32 values should we validate in the test?

@mattip
Copy link
Member

mattip commented Apr 12, 2019

(1) what threshold to use? and (2) which float32 values should we validate in the test?

The threshold is TBD, for instance in the test I linked to log is given a larger tolerance than the other tests. The values used to test cmath are the ones known to be problematic: subnormals, near cutoffs in the algorithms, with mantissas near 0 or near the max, -0, ...

It is a lot of work to create the "expected" result. I started an exchange a few years ago on how to generate the file for 64-bit complex numbers and 32 bit floats, maybe today it would be easier.

@r-devulap
Copy link
Member Author

sounds good. I am almost done with a another patch vectorizing sine and cosine. Would you mind if I submit the ULP test as part of that patch and we can merge this as is?

@mattip mattip removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 14, 2019
This commit implements vectorized single precision exponential and
natural log using AVX2 and AVX512.

Accuracy:

| Function | Max ULP Error | Max Relative Error |
|----------|---------------|--------------------|
| np.exp   | 2.52          | 2.1E-07            |
| np.log   | 3.83          | 2.4E-07            |

Performance:

(1) Micro-benchmarks: measured execution time of np.exp and np.log using
timeit package in python. Each function is executed 1000 times and this
is repeated 100 times. The standard deviation for all the runs was less
than 2% of their mean value and hence not included in the data. The
vectorized implementation was upto 7.6x faster than the scalar version.

| Function | NumPy1.16 | AVX2   | AVX512 | AVX2 speedup | AVX512 speedup |
| -------- | --------- | ------ | ------ | ------------ | -------------- |
| np.exp   | 0.395s    | 0.112s | 0.055s | 3.56x        | 7.25x          |
| np.log   | 0.456s    | 0.147s | 0.059s | 3.10x        | 7.64x          |

(2) Logistic regression: exp and log are heavily used in training neural
networks (as part of sigmoid activation function and loss function
respectively). This patch significantly speeds up training a logistic
regression model. As an example, we measured how much time it takes to
train a model with 15 features using 1000 training data points. We
observed a 2x speed up to train the model to achieve a loss function
error < 10E-04.

| Function       | NumPy1.16  | AVX2   | AVX512 | AVX2 speedup | AVX512 speedup |
| -------------- | ---------- | ------ | ------ | ------------ | -------------- |
| logistic.train | 121.0s     | 75.02s | 60.60s | 1.61x        | 2.02x          |
1) Got rid of @isa@_cmp_mask helper function, since clang expects the
compare operator in _mm@vsize@_cmp_ps to be a compile time constant

2) Converted all helper functions to static functions
clang6.0 fails to compile this code:

__asm__ __volatile__ (
    "vpaddd %zmm1, %zmm2, %zmm3\n\t"
);

Note that this is a known issue in clang6.0. clang7.0 and gcc does not
have this problem. This fails to set the flag HAVE_LINK_AVX512F. Hence,
the AVX512F version of exp and log doesn't get built. If AVX512F is
detected during runtime, instead of choosing to run the AVX2 version, it
will end up running scalar version.
1) use __builtin_cpu_supports("avx512f") only for gcc ver >= 5

2) Introduced two new macro's:
HAVE_ATTRIBUTE_TARGET_@ISA@_WITH_INTRINSICS for ensuring compiler can
compile functions that use intrinsics and are compiled with avx2/avx512f
attributes
Added a missing NPY_HAVE_SSE2_INTRINSICS guard
@r-devulap
Copy link
Member Author

Hi, just checking in on the status of the patch. Let me know if there is anything else you need from me.

@mattip mattip merged commit cd2e52c into numpy:master Apr 20, 2019
@mattip
Copy link
Member

mattip commented Apr 20, 2019

Thanks @r-devulap

@seberg
Copy link
Member

seberg commented Jul 5, 2019

I am not really worried, but I stumbled on this looking up recent speed improvements in NumPy. The speedups here probably mean a slightly larger error on the functions (compared to non-vectorized/fastmath versions)? At what point do we worry about such precision changes?

@mattip
Copy link
Member

mattip commented Jul 5, 2019

xref #13515

@r-devulap
Copy link
Member Author

@seberg for comparison against libm (scalar) and libmvec ( glibc's vector implementation):

Function libm max ULP libmvec max ULP this patch max ULP
sine 1.19 2.47 1.49
cosine 1.19 2.39 1.49
exp 0.51 2.93 2.52
log 0.51 3.96 3.83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants