Skip to content

PERF: In loadtxt, decide once and for all whether decoding is needed. #19609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 11, 2021

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Aug 3, 2021

... and use a single decoder function instead of repeatedly checking the
input type (in _decode_line).

~5-8% speedup.

       before           after         ratio
     [b5cdbf04]       [d3f0d905]
     <main>           <loadtxtstaticdecoder>
-         361±2ms          344±1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         533±1μs          507±3μs     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
-       281±0.3ms          267±1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
-      35.0±0.2ms       33.2±0.1ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-      52.5±0.5ms       49.8±0.6ms     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
-         309±7ms          293±4ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)
-         334±4ms          317±1ms     0.95  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-         372±3ms        351±0.7ms     0.95  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-     5.16±0.03ms      4.88±0.02ms     0.95  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
-      29.9±0.7ms      28.3±0.05ms     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-     1.46±0.02ms      1.38±0.01ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-         370±5ms        348±0.9ms     0.94  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-         297±2μs          279±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-         292±3ms          275±1ms     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-      28.2±0.3ms      26.5±0.04ms     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-         368±1μs          345±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-       294±0.4μs          275±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-        1.46±0ms      1.37±0.01ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-         293±3μs          275±2μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-     14.6±0.06ms       13.6±0.1ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-      14.6±0.1ms      13.7±0.06ms     0.94  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-         811±8μs         756±20μs     0.93  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-         316±6μs          294±1μs     0.93  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-         814±4μs          756±2μs     0.93  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-     8.06±0.04ms      7.41±0.04ms     0.92  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)

@anntzer anntzer force-pushed the loadtxtstaticdecoder branch from fec6db7 to 3d3ee1d Compare August 4, 2021 04:40
@anntzer anntzer changed the title In loadtxt, decide once and for all whether decoding is needed. PERF: In loadtxt, decide once and for all whether decoding is needed. Aug 4, 2021
@anntzer anntzer force-pushed the loadtxtstaticdecoder branch from 3d3ee1d to a5232ba Compare August 4, 2021 11:51
@@ -993,8 +992,7 @@ def read_data(chunk_size):
X = []
line_iter = itertools.chain([first_line], fh)
line_iter = itertools.islice(line_iter, max_rows)
for i, line in enumerate(line_iter):
vals = split_line(line)
for i, vals in enumerate(map(split_line, map(decode, line_iter))):
Copy link
Member

@mattip mattip Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not required by variable scoping, perhaps it would be more readable if decode were passed in as an argument to read_data. Or would that slow things down?

Copy link
Contributor Author

@anntzer anntzer Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I decided to just move some of my not-yet PR'ed further improvements into this PR (second commit): I now construct an iterator (lineno_words_iter) that includes more preprocessing steps outside, and pass it to read_data. This will mostly simplify yet another later improvement, but also slightly optimizes len(vals) == 0 into (effectively) a bool check (in filter) (5% on some benchmarks, although that's not very stable).
How does that look to you?

@anntzer anntzer force-pushed the loadtxtstaticdecoder branch 2 times, most recently from 6c4530a to 955bee5 Compare August 5, 2021 16:13
@@ -980,40 +980,33 @@ def loadtxt(fname, dtype=float, comments='#', delimiter=None,

def split_line(line):
"""Chop off comments, strip, and split at delimiter. """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding a comment about expected input being already decoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see type hint below

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one nit, but could be merged without that.

@mattip
Copy link
Member

mattip commented Aug 6, 2021

It seems CI is acting up. Maybe close/open this in a few hours to see if that clears the problem

@eric-wieser
Copy link
Member

Doesn't this conflict with some of your other PRs? Do you have a preference for merge order?

@anntzer anntzer force-pushed the loadtxtstaticdecoder branch from 955bee5 to 0f649f3 Compare August 6, 2021 09:38
@anntzer
Copy link
Contributor Author

anntzer commented Aug 6, 2021

I don't mind doing the rebases, feel free to merge them in any order and I'll keep popping patches from my local stack :)

... and use a single decoder function instead of repeatedly checking the
input type (in `_decode_line`).

~5-8% speedup.
@anntzer anntzer force-pushed the loadtxtstaticdecoder branch 2 times, most recently from f389356 to d9fe883 Compare August 6, 2021 17:27
@@ -1108,7 +1101,8 @@ def read_data(chunk_size):
# Read until we find a line with some values, and use it to determine
# the need for decoding and estimate the number of columns.
for first_line in fh:
ncols = len(usecols or split_line(first_line))
ncols = len(usecols
or split_line(_decode_line(first_line, encoding)))
Copy link
Member

@eric-wieser eric-wieser Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can eliminate the use of decode_line on lines in the file entirely? Either by detecting whether the file is text right at the beginning:

# are we going to break on some janky file-like object that doesn't play ball?
needs_decode = not isinstance(fh, io.TextIOBase)

or pulling off the first line much earlier:

for first_line in fh:
    needs_decode = isinstance(first_line, bytes)
    fh = itertools.chain([first_line], fh)
    break
else:
    needs_decode = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case. So from a readability PoV it may not be a big net plus, and of course speedwise it shouldn't matter at all, so let's perhaps defer that to another time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I raise it is that this refactor leads to the latin1 encoding cropping up in multiple places.

This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case.

I don't think you do - you can still defer emitting the that warning until you try and count the columns of the file; you can always retry to iterate an empty iterator.

Copy link
Member

@eric-wieser eric-wieser Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course speedwise it shouldn't matter at all,

I think it may matter if your file starts with 10000 blank lines, and you ask each one whether it is a byte or string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

Mostly to help later speedups, but also slightly optimizes
`len(vals) == 0` into a bool check (in `filter`).
@anntzer anntzer force-pushed the loadtxtstaticdecoder branch from d9fe883 to 6b681d9 Compare August 6, 2021 17:52
@charris charris merged commit 1b34367 into numpy:main Aug 11, 2021
@charris
Copy link
Member

charris commented Aug 11, 2021

Thanks @anntzer .

@anntzer anntzer deleted the loadtxtstaticdecoder branch August 11, 2021 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants