PERF: Avoid unnecessary string operations in loadtxt. #19734
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR goes on top of #19687 (only the last commit is new), but showcases another speed benefit of special-casing numeric types in loadtxt, so I thought I may as well post it already :-)
When using _IMPLICIT_CONVERTERS, it is actually OK if strings are
passed with trailing newlines (so we don't need to strip
"\r\n"
),and comments can be implicitly detected because the converters raise
ValueError on them. Therefore, one can use an "approximate" line
splitter, which doesn't remove trailing comments and newlines, falling
back on the full line splitter if needed. This provides a 10-20%
speedup in the case where there are actually no comments in the file (we
could instead check the value of the
comments
kwarg, but it defaultsto a non-empty value and it seems likely most users will not notice that
a large speedup can be achieved by emptying it).
However, if there are actual comments in the file, then recreating the
original string from the approximately split one and then re-splitting
it is very costly (it would incur a >2x slowdown), so switch back to the
full splitter (controlled by a local flag) in that case. Overall, only
very short (10 rows) loads that include comments are slowed down by ~10%
(likely by the extra processing on the row with comments).
(To be fully explicit, despite its name, the "approximate" splitter will
never parse incorrect values; it may simply "fail", but we just fall
back to the full/slow splitter in that case.)
The obligatory benchmarks: