Skip to content

py/parsenum: Implement exact float parsing using integer mpz (WIP) #6024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dpgeorge
Copy link
Member

This PR attempts to implement exact parsing of floats in mp_parse_num_decimal(). There have been quite a few issues related to this, and parsing floats correctly is important, so I thought it'd be worth trying to fix this properly.

Other implementations I could find (eg musl's) are pretty complex and use a lot of stack. The implementation here is deigned and written from scratch (but it's probably not original) and reuses the existing MicroPython big-int functions to do the bulk of the conversion.

The point is that all operations during the conversion must be exact (ie not do any rounding or else the rounding will compound), except as an optimisation the very last bit can use the FP hardware because it'll round in the correct way.

The idea is to parse the input number as a big-int and then apply the exponent also using big-int arithmetic. The trick is that the exponent of 10**e can be factored into (2*5)**e of which the 2**e part can be done exactly in hardware because that's the natural base of the FP representation. So only the 5**e calculation needs to be done using big-ints. If e is negative then the big-int is pre-multiplied by a large enough number (namely 8**e) so it can then be divided by 5**e without loss.

The entire parsing could be done without any FP hardware, ie just using big-ints to construct the bits in the floating point representation (mantissa, exponent, sign bit). But it seems OK to do the last bit using standard float operations (converting the big-int to a float then adjusting by the power-of-2 exponent) and this makes the code a bit simpler (reuses existing functions, like ldexp).

Tested against a set of known difficult numbers (from CPython test suite) and it parses all of them correctly (in double precision mode).

TODO/outstanding:

  • do more tests
  • test with 32-bit floats
  • don't allocate heap memory for temporary mpz's, use the stack instead
  • benchmark to see if it has acceptable performance

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label May 10, 2020
@dpgeorge
Copy link
Member Author

For an example of a number that did not parse correctly before this PR, but parses correctly now, consider 74e46. This used to parse to b'\xee\xad\xa0\xec\xd73\xe0I' but now parses correctly to b'\xef\xad\xa0\xec\xd73\xe0I' (note LSB). And getting this correct is hard because 74 can be represented exactly in a double, so the error is introduced by applying the exponent, ie the multiplication by 10**46. This exponent cannot be represented exactly on its own so pow(10, 46) is inexact, and then doing the multiplication 74 * pow(10, 46) propagates the error.

The way this number is parsed in this PR is:

  • 74 parsed as a big-int, with exponent 46 parsed as a machine int
  • 5**exp computed as a big-int
  • 74 * 5**exp computed as a big-int (still exact)
  • normalisation of 74 * 5**exp which does correct rounding (this step is not strictly needed though)
  • convert 74 * 5**exp to a float, no longer exact but as close as possible to the true value (due to IEEE properties of FP)
  • multiply byt 2**exp by: ldexp(float(74 * 5**exp), exp) which is an exact FP operation (basically just adjusts the FP exponent of 74 * 5**exp)

@stinos
Copy link
Contributor

stinos commented May 13, 2020

This looks really good. Didn't check the code in detail but the logic is sound and if tests pass the code should be correct :) With that in mind: it's probably worth adding as much 'speical' cases as needed to cover a good range. Then if the performance isn't abysmal this is the way forward I think.

@dpgeorge
Copy link
Member Author

With that in mind: it's probably worth adding as much 'speical' cases as needed to cover a good range.

Not sure what you mean by this, which "special cases"?

@stinos
Copy link
Contributor

stinos commented May 13, 2020

The ones which don't parse correctly now like in your previous comment. And picked across the whole range of doubles, i.e. also very large and very small ones.

@dpgeorge
Copy link
Member Author

Aah, I see, you mean add specific tests. Yes indeed.

For 32-bit parsing it should be possible to test every value (although not as part of the test suite, it'd take too long).

dpgeorge added 3 commits May 16, 2020 16:36
Float parsing should now be exact.  Costs about 200 bytes on stm32.

TODO:
- test more
- don't allocate heap memory, use the stack instead
- benchmark to see if it's acceptable
@dpgeorge dpgeorge force-pushed the py-parsenum-exact-float branch from 489efde to 1c5238c Compare May 16, 2020 07:38
@ddiminnie
Copy link

ddiminnie commented Jun 5, 2020

If you would like some additional 'special case' double-precision tests, feel free to use these (or feel free to bin them - won't hurt my feelings :-) )

(All cases refer to 17-digits-of-precision decimal inputs (sans trailing zeros), with expected results expressed as (little-endian) IEEE-754 binary64 values.)

Overflow/largest double boundary:
1.7976931348623159e308 (should overflow, 0x000000000000f07f )
1.7976931348623158e308 (should yield the max double, 0xffffffffffffef7f )

Normalized/denormalized double boundary
2.2250738585072012e-308 (should yield smallest normalized double, 0x0000000000001000 )
2.2250738585072011e-308 (should yield largest denormalized double, 0xffffffffffff0f00 )

Shortest (up to) 17-digit input that converts to smallest denormalized double:
5e-324 (should yield smallest denormalized double, 0x0100000000000000 )

Closest 17-digit input to the smallest denormalized double:
4.9406564584124654e-324 (should yield smallest denormalized double, 0x0100000000000000 )

The next boundary will depend on how good the ldexp implementation is on the target platform:
Smallest denormalized double/underflow boundary:

2.4703282292062328e-324 (should yield smallest denormalized double, 0x0100000000000000 )
(Note that this value is greater than 2**-1075 and therefore should round up. 64-bit CPython 3.7.5 on win32 gets this right. Your mileage may vary, since the 54 most significant bits of the result are 0b1.00000000000000000000000000000000000000000000000000000 x 2**-1075.)

2.4703282292062327e-324 (should underflow to zero: 0x0000000000000000 )
(Note that this value is less than 2**-1075 and therefore should round down to zero.)

@projectgus
Copy link
Contributor

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants