You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have recently posted a series of PRs (#19599, #19601, #19608, #19598, #19610, #19620, #19609, #19618, #19687) which altogether speed up loadtxt() by up to 4x when parsing simple but common (specifically, entirely numeric) text files (see benchmarks in #19687, for example) solely by removing a lot of Python overhead, and have yet another patch in the works that should yield a further sizeable improvement. After including that last patch, when parsing e.g. 2 columns of floats, I find (if my profiling is correct...) that nearly ~30% of the runtime is spent in https://docs.python.org/3/c-api/conversion.html#c.PyOS_string_to_double, which numpy ultimately uses to parse floats. Therefore, it may make sense to consider adopting a faster floating point parser (e.g., if https://lemire.me/blog/2020/03/10/fast-float-parsing-in-practice/ really provides a 10x speedup, then we effectively gain another 30% of speed). Of course, we'd need to fall back to PyOS_string_to_double if the former fails, in order to parse "weird" formats supported by Python (e.g., underscores in literals), but they should be relatively uncommon. Also, if one day loadtxt() does get fully rewritten in C (getting rid of the Python overhead that comprises a decent part of the remaining 70%), the speedup on floating point parsing would be even more relevant, so adopting the fast parser would not be wasted work. In any case, I am posting this here because I don't have the intent right now to do that work :-), but it seems to be a relatively self-contained project, so others may be interested...
Happy if someone wants to reopen this, but I am going to be blunt and close this, since I doubt that it is a great TODO (i.e. if someone wants to push this, they also need to be prepared to show that it is actually worthwhile).
My main reason is that all of those "huge" speedups seem to be timed for full-precision floats. The parser included in Python has the characteristic that complicated stuff happens for 12 or more decimal digits (did not check the exponent logic, though). If you parse short floats with only 11 decimal digits, the speed difference is much smaller.
So yes, the float parsing is still a huge chunk, but I am not convinced it is nearly as much of a low-hanging and worthwhile fruit as it seems on first sight.
Feature
I have recently posted a series of PRs (#19599, #19601, #19608, #19598, #19610, #19620, #19609, #19618, #19687) which altogether speed up loadtxt() by up to 4x when parsing simple but common (specifically, entirely numeric) text files (see benchmarks in #19687, for example) solely by removing a lot of Python overhead, and have yet another patch in the works that should yield a further sizeable improvement. After including that last patch, when parsing e.g. 2 columns of floats, I find (if my profiling is correct...) that nearly ~30% of the runtime is spent in https://docs.python.org/3/c-api/conversion.html#c.PyOS_string_to_double, which numpy ultimately uses to parse floats. Therefore, it may make sense to consider adopting a faster floating point parser (e.g., if https://lemire.me/blog/2020/03/10/fast-float-parsing-in-practice/ really provides a 10x speedup, then we effectively gain another 30% of speed). Of course, we'd need to fall back to PyOS_string_to_double if the former fails, in order to parse "weird" formats supported by Python (e.g., underscores in literals), but they should be relatively uncommon. Also, if one day loadtxt() does get fully rewritten in C (getting rid of the Python overhead that comprises a decent part of the remaining 70%), the speedup on floating point parsing would be even more relevant, so adopting the fast parser would not be wasted work. In any case, I am posting this here because I don't have the intent right now to do that work :-), but it seems to be a relatively self-contained project, so others may be interested...
Alternatively, one could lobby the CPython core devs to directly base PyOS_string_to_double on the Lemire parser, but see https://bugs.python.org/issue41310. Also of note are the pandas float parsers (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c), although I would (personally) absolutely not want to use an approximate float parser in loadtxt().
The text was updated successfully, but these errors were encountered: