PERF: In loadtxt, decide once and for all whether decoding is needed. #19609

anntzer · 2021-08-03T21:40:34Z

... and use a single decoder function instead of repeatedly checking the
input type (in _decode_line).

~5-8% speedup.

       before           after         ratio
     [b5cdbf04]       [d3f0d905]
     <main>           <loadtxtstaticdecoder>
-         361±2ms          344±1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         533±1μs          507±3μs     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
-       281±0.3ms          267±1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
-      35.0±0.2ms       33.2±0.1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-      52.5±0.5ms       49.8±0.6ms     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
-         309±7ms          293±4ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)
-         334±4ms          317±1ms     0.95  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-         372±3ms        351±0.7ms     0.95  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-     5.16±0.03ms      4.88±0.02ms     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
-      29.9±0.7ms      28.3±0.05ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-     1.46±0.02ms      1.38±0.01ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-         370±5ms        348±0.9ms     0.94  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-         297±2μs          279±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-         292±3ms          275±1ms     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-      28.2±0.3ms      26.5±0.04ms     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-         368±1μs          345±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-       294±0.4μs          275±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-        1.46±0ms      1.37±0.01ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-         293±3μs          275±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-     14.6±0.06ms       13.6±0.1ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-      14.6±0.1ms      13.7±0.06ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-         811±8μs         756±20μs     0.93  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-         316±6μs          294±1μs     0.93  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-         814±4μs          756±2μs     0.93  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-     8.06±0.04ms      7.41±0.04ms     0.92  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)

mattip · 2021-08-04T16:34:05Z

numpy/lib/npyio.py

@@ -993,8 +992,7 @@ def read_data(chunk_size):
        X = []
        line_iter = itertools.chain([first_line], fh)
        line_iter = itertools.islice(line_iter, max_rows)
-        for i, line in enumerate(line_iter):
-            vals = split_line(line)
+        for i, vals in enumerate(map(split_line, map(decode, line_iter))):


While not required by variable scoping, perhaps it would be more readable if decode were passed in as an argument to read_data. Or would that slow things down?

Actually, I decided to just move some of my not-yet PR'ed further improvements into this PR (second commit): I now construct an iterator (lineno_words_iter) that includes more preprocessing steps outside, and pass it to read_data. This will mostly simplify yet another later improvement, but also slightly optimizes len(vals) == 0 into (effectively) a bool check (in filter) (5% on some benchmarks, although that's not very stable).
How does that look to you?

mattip · 2021-08-06T08:38:29Z

numpy/lib/npyio.py

@@ -980,40 +980,33 @@ def loadtxt(fname, dtype=float, comments='#', delimiter=None,

    def split_line(line):
        """Chop off comments, strip, and split at delimiter. """


Maybe worth adding a comment about expected input being already decoded?

see type hint below

mattip

LGTM, just one nit, but could be merged without that.

mattip · 2021-08-06T08:41:04Z

It seems CI is acting up. Maybe close/open this in a few hours to see if that clears the problem

eric-wieser · 2021-08-06T08:48:45Z

Doesn't this conflict with some of your other PRs? Do you have a preference for merge order?

numpy/lib/npyio.py

anntzer · 2021-08-06T09:39:02Z

I don't mind doing the rebases, feel free to merge them in any order and I'll keep popping patches from my local stack :)

... and use a single decoder function instead of repeatedly checking the input type (in `_decode_line`). ~5-8% speedup.

numpy/lib/npyio.py

eric-wieser · 2021-08-06T17:37:37Z

numpy/lib/npyio.py

@@ -1108,7 +1101,8 @@ def read_data(chunk_size):
        # Read until we find a line with some values, and use it to determine
        # the need for decoding and estimate the number of columns.
        for first_line in fh:
-            ncols = len(usecols or split_line(first_line))
+            ncols = len(usecols
+                        or split_line(_decode_line(first_line, encoding)))


Do you think we can eliminate the use of decode_line on lines in the file entirely? Either by detecting whether the file is text right at the beginning:

# are we going to break on some janky file-like object that doesn't play ball? needs_decode = not isinstance(fh, io.TextIOBase)

or pulling off the first line much earlier:

for first_line in fh: needs_decode = isinstance(first_line, bytes) fh = itertools.chain([first_line], fh) break else: needs_decode = False

This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case. So from a readability PoV it may not be a big net plus, and of course speedwise it shouldn't matter at all, so let's perhaps defer that to another time?

The only reason I raise it is that this refactor leads to the latin1 encoding cropping up in multiple places.

This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case.

I don't think you do - you can still defer emitting the that warning until you try and count the columns of the file; you can always retry to iterate an empty iterator.

of course speedwise it shouldn't matter at all,

I think it may matter if your file starts with 10000 blank lines, and you ask each one whether it is a byte or string

Mostly to help later speedups, but also slightly optimizes `len(vals) == 0` into a bool check (in `filter`).

charris · 2021-08-11T14:13:41Z

Thanks @anntzer .

anntzer force-pushed the loadtxtstaticdecoder branch from fec6db7 to 3d3ee1d Compare August 4, 2021 04:40

anntzer changed the title ~~In loadtxt, decide once and for all whether decoding is needed.~~ PERF: In loadtxt, decide once and for all whether decoding is needed. Aug 4, 2021

anntzer force-pushed the loadtxtstaticdecoder branch from 3d3ee1d to a5232ba Compare August 4, 2021 11:51

mattip reviewed Aug 4, 2021

View reviewed changes

anntzer force-pushed the loadtxtstaticdecoder branch 2 times, most recently from 6c4530a to 955bee5 Compare August 5, 2021 16:13

mattip reviewed Aug 6, 2021

View reviewed changes

mattip approved these changes Aug 6, 2021

View reviewed changes

eric-wieser reviewed Aug 6, 2021

View reviewed changes

numpy/lib/npyio.py Outdated Show resolved Hide resolved

anntzer force-pushed the loadtxtstaticdecoder branch from 955bee5 to 0f649f3 Compare August 6, 2021 09:38

This was referenced Aug 6, 2021

PERF: Simplify some of loadtxt's standard converters. #19620

Merged

MAINT: Skip a type check in loadtxt when using user converters. #19622

Merged

PERF: In loadtxt, decide once and for all whether decoding is needed.

e6be897

... and use a single decoder function instead of repeatedly checking the input type (in `_decode_line`). ~5-8% speedup.

anntzer force-pushed the loadtxtstaticdecoder branch 2 times, most recently from f389356 to d9fe883 Compare August 6, 2021 17:27

eric-wieser reviewed Aug 6, 2021

View reviewed changes

numpy/lib/npyio.py Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 6, 2021

View reviewed changes

loadtxt: Preconstruct a (lineno, words) iterator to pass to read_data.

6b681d9

Mostly to help later speedups, but also slightly optimizes `len(vals) == 0` into a bool check (in `filter`).

anntzer force-pushed the loadtxtstaticdecoder branch from d9fe883 to 6b681d9 Compare August 6, 2021 17:52

Move loadtxt bytes/str detection much earlier.

3affc17

charris merged commit 1b34367 into numpy:main Aug 11, 2021

anntzer deleted the loadtxtstaticdecoder branch August 11, 2021 20:29

anntzer mentioned this pull request Aug 18, 2021

Include a fast floating point parser #19708

Closed

		@@ -980,40 +980,33 @@ def loadtxt(fname, dtype=float, comments='#', delimiter=None,

		def split_line(line):
		"""Chop off comments, strip, and split at delimiter. """

Uh oh!

PERF: In loadtxt, decide once and for all whether decoding is needed. #19609

PERF: In loadtxt, decide once and for all whether decoding is needed. #19609

Uh oh!

Conversation

anntzer commented Aug 3, 2021

Uh oh!

mattip Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattip Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

anntzer Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Aug 6, 2021

Uh oh!

eric-wieser commented Aug 6, 2021

Uh oh!

Uh oh!

anntzer commented Aug 6, 2021

Uh oh!

Uh oh!

eric-wieser Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

eric-wieser Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

eric-wieser Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

charris commented Aug 11, 2021

Uh oh!

Uh oh!

mattip Aug 4, 2021 •

edited

Loading

anntzer Aug 4, 2021 •

edited

Loading

eric-wieser Aug 6, 2021 •

edited

Loading

eric-wieser Aug 6, 2021 •

edited

Loading