Skip to content

GH-100485: Create an alternative code path when an accurate fma() implementation is not available #101567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Lib/test/test_math.py
Original file line number Diff line number Diff line change
Expand Up @@ -1450,6 +1450,11 @@ def Trial(dotfunc, c, n):
n = 20 # Length of vectors
c = 1e30 # Target condition number

# If the following test fails, it means that the C math library
# implementation of fma() is not compliant with the C99 standard
# and is inaccurate. To solve this problem, make a new build
# with the symbol UNRELIABLE_FMA defined. That will enable a
# slower but accurate code path that avoids the fma() call.
relative_err = median(Trial(math.sumprod, c, n) for i in range(times))
self.assertLess(relative_err, 1e-16)

Expand Down
43 changes: 43 additions & 0 deletions Modules/mathmodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -2851,6 +2851,8 @@ dl_sum(double a, double b)
return (DoubleLength) {x, y};
}

#ifndef UNRELIABLE_FMA

static DoubleLength
dl_mul(double x, double y)
{
Expand All @@ -2860,6 +2862,47 @@ dl_mul(double x, double y)
return (DoubleLength) {z, zz};
}

#else

/*
The default implementation of dl_mul() depends on the C math library
having an accurate fma() function as required by § 7.12.13.1 of the
C99 standard.

The UNRELIABLE_FMA option is provided as a slower but accurate
alternative for builds where the fma() function is found wanting.
The speed penalty may be modest (17% slower on an Apple M1 Max),
so don't hesitate to enable this build option.

The algorithms are from the T. J. Dekker paper:
A Floating-Point Technique for Extending the Available Precision
https://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf
*/

static DoubleLength
dl_split(double x) {
// Dekker (5.5) and (5.6).
double t = x * 134217729.0; // Veltkamp constant = 2.0 ** 27 + 1
double hi = t - (t - x);
double lo = x - hi;
return (DoubleLength) {hi, lo};
}

static DoubleLength
dl_mul(double x, double y)
{
// Dekker (5.12) and mul12()
DoubleLength xx = dl_split(x);
DoubleLength yy = dl_split(y);
double p = xx.hi * yy.hi;
double q = xx.hi * yy.lo + xx.lo * yy.hi;
double z = p + q;
double zz = p - z + q + xx.lo * yy.lo;
return (DoubleLength) {z, zz};
}

#endif

typedef struct { double hi; double lo; double tiny; } TripleLength;

static const TripleLength tl_zero = {0.0, 0.0, 0.0};
Expand Down