Skip to content

py/mkrules.mk: workaround fused multiply-add inaccuracy #5995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

py/mkrules.mk: workaround fused multiply-add inaccuracy #5995

wants to merge 1 commit into from

Conversation

yangfl
Copy link
Contributor

@yangfl yangfl commented May 1, 2020

On arm64, print(float('.' + '9' * 70)) resulted in 1.000000000000001 rather that 1.0.

This is because

dec_val = 10 * dec_val + dig;

was converted into a single fused multiply-add instruction (fmadd) rather than two (mulsd and addsd on x86), thus caused a slightly different rounding loss when dec_val is about to overflow.

The parsing process of '.' + '9' * 70 would look like this on x86

10000000000000003053453312.000000 45208b2a2c280292 -25
100000000000000021944598528.000000 4554adf4b7320336 -26
1000000000000000288165462016.000000 4589d971e4fe8404 -27

and on arm64

10000000000000003053453312.000000 45208b2a2c280292 -25
100000000000000039124467712.000000 4554adf4b7320337 -26
1000000000000000425604415488.000000 4589d971e4fe8405 -27

Detect the overflow earily to avoid the rounding loss.

@stinos
Copy link
Contributor

stinos commented May 1, 2020

Can you fix codestyle (run tools/codeformat.py), add tests for this exact behavior, and review why the current float_parse tests fail on some platforms?

On arm64, `print(float('.' + '9' * 70))` resulted in `1.000000000000001`
rather that `1.0`.

This is because

  dec_val = 10 * dec_val + dig;

was converted into a single fused multiply-add instruction (fmadd) rather than
two (mulsd and addsd on x86), thus caused a slightly different rounding loss
when `dec_val` is about to overflow.

The parsing process of `'.' + '9' * 70` would look like this on x86

  dec_val                      *(uint64_t*)&dec_val   exp_extra
  ...
  10000000000000003053453312.000000 45208b2a2c280292 -25
  100000000000000021944598528.000000 4554adf4b7320336 -26
  1000000000000000288165462016.000000 4589d971e4fe8404 -27
  ...

and on arm64

  ...
  10000000000000003053453312.000000 45208b2a2c280292 -25
  100000000000000039124467712.000000 4554adf4b7320337 -26
  1000000000000000425604415488.000000 4589d971e4fe8405 -27
  ...

Note the difference when `exp_extra` = -26.

Unfortunately there are no easy ways to fix this problem [1], furthermore GCC
does not recognize `#pragma STDC FP_CONTRACT OFF`, while it does enable
FP_CONTRACT as soon as -O2 (or higher). The only feasible solution appears to
me is only `-ffp-contract=off`.

[1] https://stackoverflow.com/a/42134261
[2] https://lists.freedesktop.org/archives/libreoffice/2017-December/079046.html
@yangfl yangfl changed the title py/parsenum.c: use safe way to parse decimal py/mkrules.mk: workaround fused multiply-add inaccuracy May 1, 2020
@yangfl
Copy link
Contributor Author

yangfl commented May 1, 2020

Come up with another solution. Now tests/float/float_parse.py passes on arm64.

@dlech
Copy link
Contributor

dlech commented May 1, 2020

dec_val = 10 * dec_val + dig;

Just for cross-reference, this looks like the same code that is causing #5831

@dpgeorge
Copy link
Member

dpgeorge commented May 3, 2020

Thanks for the patch, but I don't think this is the right way to fix the problem. The mp_parse_num_decimal() function is not perfect (as described by a few other issues/PRs) and would probably need a substantial rewrite to make it so.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label May 3, 2020
@dpgeorge
Copy link
Member

dpgeorge commented May 3, 2020

If you want to work around the issue for ARM64 builds, then it's possible to pass CFLAGS_EXTRA=-ffp-contract=off when building with make.

@yangfl
Copy link
Contributor Author

yangfl commented May 3, 2020

Thanks and I see. But I think I'd rather add that flag in makefile to prevent those problems before someone does actually come up with a bugfree mp_parse_num_decimal which won't be easy as described in the links.

@dpgeorge
Copy link
Member

dpgeorge commented May 3, 2020

But I think I'd rather add that flag in makefile to prevent those problems

These flags put in py/mkrules.mk affect all systems. If anything the flags could go in ports/unix/Makefile and be selective based on the architecture (since it's currently ok for eg x86).

But, do these flags fix more cases of floating port parsing then they break? Previous attempts to fix the mp_parse_num_decimal() have made certain numbers parse correctly but break parsing of numbers which used to work.

@yangfl
Copy link
Contributor Author

yangfl commented May 3, 2020

At lease it does not break any tests under tests/. It should be enabled for all systems since there may be more fused operations hidden in the code, which are both hard to point out and somewhat impossible to avoid, and x86 SIMD has fused instructions. Anyway it's all up to you and I'd happy to see any real fix.

@dpgeorge
Copy link
Member

dpgeorge commented May 3, 2020

It should be enabled for all systems since there may be more fused operations hidden in the code, which are both hard to point out and somewhat impossible to avoid,

But are these operations bad? Some systems may be relying on them in other code (eg FFT) and having the option in py/mkrules.mk means it disables them for all code, including 3rd party libraries.

The usual case is that it is up to a port (eg unix port) to apply relevant compiler flags depending on the target system/arch/OS/etc.

To make progress here it'd be good to be able to reproduce the issue, eg using a virtualbox image or qemu.

@yangfl
Copy link
Contributor Author

yangfl commented May 3, 2020

#include <math.h>
#include <stdio.h>

void __attribute__((optimize("-O0"))) madd1(double a[3][2], double * restrict d) {
  for (int i = 0; i < 2; i++) {
    d[i] = a[0][i] * a[1][i] + a[2][i];
  }
}

void __attribute__((optimize("-O3"), target("fma"))) madd2(double a[3][2], double * restrict d) {
  for (int i = 0; i < 2; i++) {
    d[i] = a[0][i] * a[1][i] + a[2][i];
  }
}

int __attribute__((optimize("-O0"))) main(){
  double a[3][2], d1[2], d2[2];
  a[1][0] = 10;
  a[2][0] = 9;
  d2[0] = 9;
  for (int j = 2; j < 60; j++) {
    a[0][0] = d2[0];
    madd1(a, d1);
    madd2(a, d2);
    printf("%s %.20f %.20f\n", d1[0] == d2[0] ? "yay" : "nah", d1[0] / pow(10, j), d2[0] / pow(10, j));
  }
}

Abused SIMD to demonstrate FMA inaccuracy. But since floating-point arithmetic comes with more or less error, it's merely one's preference to choose between speed or cross-platform consistency.

@dpgeorge
Copy link
Member

dpgeorge commented May 6, 2020

I was able to reproduce this issue on Aarch64 (RPi 3 Model B+ running in 64-bit mode).

But I was also able to find a floating point number that is parsed correctly with fp-contraction enabled, but then incorrectly when fp-contraction is disabled. Ie the commit here fixes some floats but breaks others.

Eg the test:

import array
print(bytes(array.array('d', [float("." + "9" * 70)])))
print(bytes(array.array('d', [float("52306490527514614E49")])))

run under CPython 3.8.2 gives:

b'\x00\x00\x00\x00\x00\x00\xf0?'
b'\xdf \xd6[\x00\xde\x93M'

under MicroPython unix port (current master) which does have fp contraction:

b'\x04\x00\x00\x00\x00\x00\xf0?'
b'\xdf \xd6[\x00\xde\x93M'

(note that first is wrong, second is right) and under MicroPython unix port (current master) with fp contraction turned off:

b'\x00\x00\x00\x00\x00\x00\xf0?'
b'\xde \xd6[\x00\xde\x93M'

(note that first is now right, but second is now wrong).

The function mp_parse_num_decimal() is (I think...) always correct up to the least-significant-bit of the mantissa (bit 0). But see in the test above that the error for the first number is wrong in bit 2 of the mantissa (there's a 0x04 instead of 0x00 in the first byte). With fp-contraction turned off the case that was working that is now wrong is only wrong by bit 0. So I can see how turning fp-contraction off might be the right thing to do, to reduce error to bit 0 in all cases. More tests would need to be run to see if that's the case.

@dpgeorge
Copy link
Member

For an alternative to this, which attempts to fix the FP parsing routine, see #6024 . This has not yet been tested on Aarch64 though.

@dpgeorge
Copy link
Member

As mentioned above, this PR fixes some float parsing instances but breaks others. Better to work on a solution like #6024.

@dpgeorge dpgeorge closed this May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants