Skip to content

Include a fast floating point parser #19708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anntzer opened this issue Aug 18, 2021 · 1 comment
Closed

Include a fast floating point parser #19708

anntzer opened this issue Aug 18, 2021 · 1 comment

Comments

@anntzer
Copy link
Contributor

anntzer commented Aug 18, 2021

Feature

I have recently posted a series of PRs (#19599, #19601, #19608, #19598, #19610, #19620, #19609, #19618, #19687) which altogether speed up loadtxt() by up to 4x when parsing simple but common (specifically, entirely numeric) text files (see benchmarks in #19687, for example) solely by removing a lot of Python overhead, and have yet another patch in the works that should yield a further sizeable improvement. After including that last patch, when parsing e.g. 2 columns of floats, I find (if my profiling is correct...) that nearly ~30% of the runtime is spent in https://docs.python.org/3/c-api/conversion.html#c.PyOS_string_to_double, which numpy ultimately uses to parse floats. Therefore, it may make sense to consider adopting a faster floating point parser (e.g., if https://lemire.me/blog/2020/03/10/fast-float-parsing-in-practice/ really provides a 10x speedup, then we effectively gain another 30% of speed). Of course, we'd need to fall back to PyOS_string_to_double if the former fails, in order to parse "weird" formats supported by Python (e.g., underscores in literals), but they should be relatively uncommon. Also, if one day loadtxt() does get fully rewritten in C (getting rid of the Python overhead that comprises a decent part of the remaining 70%), the speedup on floating point parsing would be even more relevant, so adopting the fast parser would not be wasted work. In any case, I am posting this here because I don't have the intent right now to do that work :-), but it seems to be a relatively self-contained project, so others may be interested...

Alternatively, one could lobby the CPython core devs to directly base PyOS_string_to_double on the Lemire parser, but see https://bugs.python.org/issue41310. Also of note are the pandas float parsers (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c), although I would (personally) absolutely not want to use an approximate float parser in loadtxt().

@seberg
Copy link
Member

seberg commented Feb 8, 2022

Happy if someone wants to reopen this, but I am going to be blunt and close this, since I doubt that it is a great TODO (i.e. if someone wants to push this, they also need to be prepared to show that it is actually worthwhile).

My main reason is that all of those "huge" speedups seem to be timed for full-precision floats. The parser included in Python has the characteristic that complicated stuff happens for 12 or more decimal digits (did not check the exponent logic, though). If you parse short floats with only 11 decimal digits, the speed difference is much smaller.

So yes, the float parsing is still a huge chunk, but I am not convinced it is nearly as much of a low-hanging and worthwhile fruit as it seems on first sight.

@seberg seberg closed this as completed Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants