-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Resolve Divide by Zero on Apple silicon + test failures #19926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Resolve Divide by Zero on Apple silicon + test failures #19926
Conversation
clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's. The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's. `volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded. Testing: Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite. Before: ``` FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal 8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09) ``` After: ``` FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan 1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10) ```
Enhancement on top of workaround for clang bug in reciprocal (numpy#18555) Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector. The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.
It appears to be due to CI using |
I am wondering how much effort we should spend on buggy clang (or MacOS math libraries) here, especially if it only affects older versions... Most likely there are far more FPE failures on MacOS anyway and the only reason CI passes is that we don't have systematic unit tests for most functions. |
@seberg this
Other than one SciPy kernel panic related to ARPACK, we're in pretty good shape for macOS arm64. The NumPy test suite passes in release mode. |
If we can upgrade Clang to the minimum required version in CI that would be fine too. Conda Clang compilers are at 11.1 though, so not much higher than 11.0 as in CI here. If we can keep 11.1 as a base without a large amount of extra effort rather than bumping to 12.0, that would be nice. |
Actually that isn't completely correct. With SciPy master + numpy 1.21.1 there are now 101 failures. About 80 of those are BLAS/LAPACK related. With this PR that's down to 89 test failures, so it resolves about half of the non-linalg failures. What's left is similar in nature:
|
- -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64 - Apple Clang and Clang have different, but overlapping versions - Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64 Testing was done against Apple Clang versions - v11 / x86_64 - failed previously, now passes (azure failure) - v12+ / x86_64 - passes before and after - v13 / arm64 - failed before initial patch, passes after
Latest commit should resolve Azure failures now as well, waiting for CI to finish to confirm. |
Thanks @Developer-Ecosystem-Engineering . |
…9926) * Resolve divide by zero in reciprocal numpy#18555 clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's. The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's. `volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded. Testing: Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite. Before: ``` FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal 8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09) ``` After: ``` FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan 1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10) ``` * Enhancement on top of workaround for clang bug in reciprocal Enhancement on top of workaround for clang bug in reciprocal (numpy#18555) Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector. The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop. * Address Azure CI failures with older versions of clang - -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64 - Apple Clang and Clang have different, but overlapping versions - Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64 Testing was done against Apple Clang versions - v11 / x86_64 - failed previously, now passes (azure failure) - v12+ / x86_64 - passes before and after - v13 / arm64 - failed before initial patch, passes after
FYI, Conda clang 11.1 is equivalent to Apple clang 12. See the versioning at https://en.wikipedia.org/wiki/Xcode |
BUG: Resolve Divide by Zero on Apple silicon + test failures (#19926)
…9926) * Resolve divide by zero in reciprocal numpy#18555 clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's. The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's. `volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded. Testing: Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite. Before: ``` FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal 8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09) ``` After: ``` FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan 1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10) ``` * Enhancement on top of workaround for clang bug in reciprocal Enhancement on top of workaround for clang bug in reciprocal (numpy#18555) Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector. The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop. * Address Azure CI failures with older versions of clang - -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64 - Apple Clang and Clang have different, but overlapping versions - Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64 Testing was done against Apple Clang versions - v11 / x86_64 - failed previously, now passes (azure failure) - v12+ / x86_64 - passes before and after - v13 / arm64 - failed before initial patch, passes after
- test_sincos_float32 taken care of in numpy/numpy#20274 - TestF77Mismatch removed upstream in numpy/numpy#20457 - test_unary_PyUFunc_O_O_method_full fixed in 1.22 (numpy/numpy#19926)
Clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's
reciprocal
function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.The changes here insert a dummy
volatile
variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.volatile
can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to applyvolatile
. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for thereciprocal
function. Moreover,volatile
is only needed when a vector is partially loaded.Testing:
Fixes #18555 and the changes here also resolve several failures in numpy's test suite.
Before:
After:
After addressing this failure, we've added to it (second commit requires the first)
Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector.
The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.