-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
PERF: In loadtxt, decide once and for all whether decoding is needed. #19609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fec6db7
to
3d3ee1d
Compare
3d3ee1d
to
a5232ba
Compare
numpy/lib/npyio.py
Outdated
@@ -993,8 +992,7 @@ def read_data(chunk_size): | |||
X = [] | |||
line_iter = itertools.chain([first_line], fh) | |||
line_iter = itertools.islice(line_iter, max_rows) | |||
for i, line in enumerate(line_iter): | |||
vals = split_line(line) | |||
for i, vals in enumerate(map(split_line, map(decode, line_iter))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not required by variable scoping, perhaps it would be more readable if decode
were passed in as an argument to read_data
. Or would that slow things down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I decided to just move some of my not-yet PR'ed further improvements into this PR (second commit): I now construct an iterator (lineno_words_iter) that includes more preprocessing steps outside, and pass it to read_data. This will mostly simplify yet another later improvement, but also slightly optimizes len(vals) == 0
into (effectively) a bool check (in filter
) (5% on some benchmarks, although that's not very stable).
How does that look to you?
6c4530a
to
955bee5
Compare
numpy/lib/npyio.py
Outdated
@@ -980,40 +980,33 @@ def loadtxt(fname, dtype=float, comments='#', delimiter=None, | |||
|
|||
def split_line(line): | |||
"""Chop off comments, strip, and split at delimiter. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding a comment about expected input being already decoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see type hint below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one nit, but could be merged without that.
It seems CI is acting up. Maybe close/open this in a few hours to see if that clears the problem |
Doesn't this conflict with some of your other PRs? Do you have a preference for merge order? |
955bee5
to
0f649f3
Compare
I don't mind doing the rebases, feel free to merge them in any order and I'll keep popping patches from my local stack :) |
... and use a single decoder function instead of repeatedly checking the input type (in `_decode_line`). ~5-8% speedup.
f389356
to
d9fe883
Compare
numpy/lib/npyio.py
Outdated
@@ -1108,7 +1101,8 @@ def read_data(chunk_size): | |||
# Read until we find a line with some values, and use it to determine | |||
# the need for decoding and estimate the number of columns. | |||
for first_line in fh: | |||
ncols = len(usecols or split_line(first_line)) | |||
ncols = len(usecols | |||
or split_line(_decode_line(first_line, encoding))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can eliminate the use of decode_line
on lines in the file entirely? Either by detecting whether the file is text right at the beginning:
# are we going to break on some janky file-like object that doesn't play ball?
needs_decode = not isinstance(fh, io.TextIOBase)
or pulling off the first line much earlier:
for first_line in fh:
needs_decode = isinstance(first_line, bytes)
fh = itertools.chain([first_line], fh)
break
else:
needs_decode = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case. So from a readability PoV it may not be a big net plus, and of course speedwise it shouldn't matter at all, so let's perhaps defer that to another time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I raise it is that this refactor leads to the latin1
encoding cropping up in multiple places.
This should be possible, but makes things a bit more complicated because then you also need (in the later case where you need to pull lines out to know the type) to handle the possibility that the generator is empty, and then emit the empty-file warning in that case.
I don't think you do - you can still defer emitting the that warning until you try and count the columns of the file; you can always retry to iterate an empty iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course speedwise it shouldn't matter at all,
I think it may matter if your file starts with 10000 blank lines, and you ask each one whether it is a byte or string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
Mostly to help later speedups, but also slightly optimizes `len(vals) == 0` into a bool check (in `filter`).
d9fe883
to
6b681d9
Compare
Thanks @anntzer . |
... and use a single decoder function instead of repeatedly checking the
input type (in
_decode_line
).~5-8% speedup.