Skip to content

BUG: Remove check requiring natural alignment of float/double to run AVX code #15615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 21, 2020

Conversation

r-devulap
Copy link
Member

I am not entirely sure what is causing the CI failure in #15610. But the test failure seems to be because of a small ULP error (Max relative difference: 1.1176388e-16
which should be < 1ULP). This patch uses assert_array_max_ulp instead of assert_equal which should hopefully fix the test failure.

@mattip
Copy link
Member

mattip commented Feb 20, 2020

What is this part of the test meant to check? The stated purpose is to check the gather properties of the AVX loops. So I think this PR is avoiding the problem: it should only check strides that will end up using the AVX semantics. If you are hitting the glibc implementation, I think you are not testing what you want to test. Do you agree this strided part of the test is somehow escaping the AVX implementation? We will eventually get a bug report on some obscure platform that needs 2 or 3 ULP, and will spend time searching for why they report tests failing, but we cannot reproduce.

This points out the problem with having architecture-specific code: we need to be very careful with our tests. NumPy supports many obscure platforms and antiquated systems, and the developers on those platforms want to be able to run numpy.test() secure in the knowledge that it should pass. I would prefer to spend the time now to hone the tests rather than later get issues that are very hard to reproduce.

@mattip
Copy link
Member

mattip commented Feb 20, 2020

The gather part of the code does not depend on the ufunc itself, so you really don't need to check all the ufuncs here. Perhaps a better way to test this part of the code would be to add another test ufunc that uses a _mm256 copy intrinsic, and the non-AVX passthrough (the "glibc version") does nothing. Then set the output to -1, and the input to range(n). If the output remains -1, the AVX loop was not called. Otherwise, you can check that the output is range(n) (the order must be correct). We have _umath_tests.c.src for things like this. Such a test would also provide a way to check that AVX code is being called, so we can check other parts of the ufunc machinery: loop function substitution, compilation issues and more.

Edit: be more specific in the description

@r-devulap
Copy link
Member Author

For reciprocal, no matter what the value of stride jj is (negative. positive or zero), it is meant to follow the AVX route via the gather instruction which I can confirm it does on an SkylakeX with AVX-512 and Skylake client with AVX2/FMA. The test is meant to cover all these scenarios with strides = np.array([-4,-3,-2,-1,1,2,3,4])

But I think I might have a clue as to why IS_OUTPUT_BLOCKABLE_UNARY fails for some strides as you say in #15610 (comment). One of the condition for that macro is:

(npy_is_aligned(args[0], esize) && npy_is_aligned(args[1], esize)) which basically requires that doubles should be 8 byte aligned. This could likely to be the culprit. gcc gaurantees 8-byte alignment only on x86_64 and not on 32-bit systems (see -malign-double section of https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). This also can explain why the failure is only intermittent (sometime its 8 byte aligned and other times its not).

I also realized this condition is completely unnecessary. AVX code uses unaligned loads and does not care what the alignment is. So, perhaps I could get rid of this condition and this could be fixed?

@mattip
Copy link
Member

mattip commented Feb 20, 2020

Sounds reasonable for a fix, but we still should have some kind of test that the avx path is being followed

@r-devulap
Copy link
Member Author

but we still should have some kind of test that the avx path is being followed

Doesn't this #15558 exactly address this problem once Universal Intrinsics patch is merged?

@r-devulap r-devulap force-pushed the reciprocal-avx-CI-fail branch from 7221c45 to 519f7ac Compare February 20, 2020 21:42
@r-devulap r-devulap changed the title TST: Use assert_array_max_ulp instead of assert_equal for strided reciprocal test BUG: Remove check requiring natural alignment of float/double to run AVX code Feb 20, 2020
In a x86-32 bit system, doubles need not be naturally aligned to 8 Byte
boundary (see: -malign-double section of
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). Having this check
meant it ran different code paths (AVX v/s scalar) depending on the
alignment of data which leads to different results and test failing
intermittently. AVX code uses un-aligned loads and this check is
unnecessary to begin with.
@r-devulap
Copy link
Member Author

Phew! This was a fun exercise, but I think I am reasonably certain this should fix the failing test. The fact that kept bugging me the most was that the vector and scalar code gave different answers. Reciprocal is ultimately just a divide (1/arr), so the vector divide instruction vdivpd and scalar divide instruction divsd should provide the exact same answer (0 ULP) and there is no way the answers can mis-match.

But turns out for some odd reason, gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) in the manylinux2010 docker decides to use the x87 fdivl instruction (extended precision) instead of the divsd (double precision) and this doesn't happen with the gcc on Ubuntu (which uses the divsd instruction like other sane people). So that explains the output mismatch and I can sleep peacefully again :)

@mattip mattip merged commit 9b1acd4 into numpy:master Feb 21, 2020
@mattip
Copy link
Member

mattip commented Feb 21, 2020

Thanks @r-devulap

@mattip
Copy link
Member

mattip commented Feb 21, 2020

MacPython/numpy-wheels#73 is still failing

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 05 - Testing component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants