PERF: Avoid unnecessary string operations in loadtxt. #19734

anntzer · 2021-08-23T13:37:46Z

This PR goes on top of #19687 (only the last commit is new), but showcases another speed benefit of special-casing numeric types in loadtxt, so I thought I may as well post it already :-)

When using _IMPLICIT_CONVERTERS, it is actually OK if strings are
passed with trailing newlines (so we don't need to strip "\r\n"),
and comments can be implicitly detected because the converters raise
ValueError on them. Therefore, one can use an "approximate" line
splitter, which doesn't remove trailing comments and newlines, falling
back on the full line splitter if needed. This provides a 10-20%
speedup in the case where there are actually no comments in the file (we
could instead check the value of the comments kwarg, but it defaults
to a non-empty value and it seems likely most users will not notice that
a large speedup can be achieved by emptying it).

However, if there are actual comments in the file, then recreating the
original string from the approximately split one and then re-splitting
it is very costly (it would incur a >2x slowdown), so switch back to the
full splitter (controlled by a local flag) in that case. Overall, only
very short (10 rows) loads that include comments are slowed down by ~10%
(likely by the extra processing on the row with comments).

(To be fully explicit, despite its name, the "approximate" splitter will
never parse incorrect values; it may simply "fail", but we just fall
back to the full/slow splitter in that case.)

The obligatory benchmarks:

       before           after         ratio
     [df5ee9f3]       [a40561a7]
     <loadtxtflatdtype>       <loadtxt-approx-split-line>
+      44.9±0.5μs       50.0±0.1μs     1.11  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
-     46.3±0.05μs       44.0±0.1μs     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
-     47.5±0.05μs       45.1±0.4μs     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
-      45.3±0.1μs      42.7±0.08μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
-       133±0.7ms        113±0.2ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-       146±0.7ms        124±0.5ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-       145±0.7ms        123±0.4ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-         115±1μs       97.2±0.2μs     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-         125±1μs        106±0.5μs     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-     8.94±0.05ms      7.49±0.09ms     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-         116±1μs       96.9±0.6μs     0.83  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-      88.9±0.6ms       73.8±0.6ms     0.83  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         113±2μs       93.2±0.6μs     0.82  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-         795±6μs          654±4μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-         459±4μs          376±1μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-         798±6μs          654±5μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-         458±2μs          375±2μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-     7.94±0.03ms      6.47±0.04ms     0.82  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
-         114±1μs       92.5±0.3μs     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-      78.3±0.3ms       63.6±0.2ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
-      7.68±0.1ms      6.23±0.09ms     0.81  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-     7.97±0.04ms      6.47±0.05ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
-     7.67±0.07ms      6.21±0.08ms     0.81  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-      79.2±0.7ms       64.0±0.4ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
-     7.67±0.05ms       6.00±0.1ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-      76.1±0.8ms       59.5±0.9ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-      7.69±0.1ms       5.98±0.1ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-      76.6±0.4ms       59.1±0.9ms     0.77  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)

Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype, the size is set automatically from the longest entry in the first 50000 lines. If longer entries appeared later, they were silently truncated.

This is much faster (~30%) for loading actual structured dtypes (by skipping the recursive packer), somewhat faster (~5-10%) for large loads (>10000 rows, perhaps because shape inference of the final array is faster?), and much slower (nearly 2x) for very small loads (10 rows) or for reads using `dtype=object` (due to the extraneous limitation on object views, which could be fixed separately); however, the main point is to allow further optimizations.

This patch takes advantage of the possibility of assigning a tuple of *strs* to a structured dtype with e.g. float fields, and have the strs be implicitly converted to floats by numpy at the C-level. (A Python-level fallback is kept to support e.g. hex floats.) Together with the previous commit, this provides a massive speedup (~2x on the loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is beneficial with as little as 100 rows. Very small reads (10 rows) are still slower (nearly 2x for object), as well as reads using object dtypes (due to the extra copy), but the tradeoff seems worthwhile.

In the fast-path of loadtxt, the conversion to np.void implicitly checks the number of fields. Removing the explicit length check saves ~5% for the largest loads (100_000 rows) of numeric scalar types.

When using _IMPLICIT_CONVERTERS, it is actually OK if strings are passed with trailing newlines (so we don't need to strip `"\r\n"`), and comments can be implicitly detected because the converters raise ValueError on them. Therefore, one can use an "approximate" line splitter, which doesn't remove trailing comments and newlines, falling back on the full line splitter if needed. This provides a 5-23% speedup in the case where there are actually no comments in the file (we could instead check the value of the `comments` kwarg, but it defaults to a non-empty value and it seems likely most users will not notice that a large speedup can be achieved by emptying it). However, if there *are* actual comments in the file, then recreating the original string from the approximately split one and then re-splitting it is very costly (it would incur a >2x slowdown), so switch back to the full splitter (controlled by a local flag) in that case. Overall, only very short (10 rows) loads that include comments are slowed down by ~10% (likely by the extra processing on the row with comments). (To be fully explicit, despite its name, the "approximate" splitter will never parse incorrect values; it may simply "fail", but we just fall back to the full/slow splitter in that case.)

seberg · 2022-01-16T22:22:27Z

Going to close this, I am very sure gh-20580 will land soon enough that it is not worthwhile to keep this open. Plus, you get around 10× faster :).

anntzer mentioned this pull request Aug 23, 2021

Change the layout of PyArray_Descr wrt. structured dtypes? #19735

Closed

anntzer force-pushed the loadtxt-approx-split-line branch 2 times, most recently from 8575415 to 1b0b344 Compare August 25, 2021 12:18

DFEvans and others added 5 commits August 26, 2021 16:20

BUG: fix string truncation bug in loadtxt

0c33cfd

Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype, the size is set automatically from the longest entry in the first 50000 lines. If longer entries appeared later, they were silently truncated.

PERF: Implicit check for field count in loadtxt.

a126896

In the fast-path of loadtxt, the conversion to np.void implicitly checks the number of fields. Removing the explicit length check saves ~5% for the largest loads (100_000 rows) of numeric scalar types.

anntzer force-pushed the loadtxt-approx-split-line branch from 1b0b344 to 0d5642d Compare August 26, 2021 14:23

seberg closed this Jan 16, 2022

anntzer deleted the loadtxt-approx-split-line branch January 16, 2022 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Avoid unnecessary string operations in loadtxt. #19734

PERF: Avoid unnecessary string operations in loadtxt. #19734

anntzer commented Aug 23, 2021 •

edited

Loading

seberg commented Jan 16, 2022

PERF: Avoid unnecessary string operations in loadtxt. #19734

PERF: Avoid unnecessary string operations in loadtxt. #19734

Conversation

anntzer commented Aug 23, 2021 • edited Loading

seberg commented Jan 16, 2022

anntzer commented Aug 23, 2021 •

edited

Loading