-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
PERF: Rely on C-level str conversions in loadtxt for up to 2x speedup #19687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
af33ee7
to
c98e34d
Compare
bc2e615
to
df5ee9f
Compare
] | ||
# These converters only ever get str (not bytes) as input. | ||
_CONVERTER_DICT = { | ||
np.bool_: int, # Implicitly converted to bool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Booleans are only allowed values of 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that we only need to cast the str to an int, and then we can let the dtype machinery do the int->np.bool_ cast (and np.bool_(42) == np.bool_(True)
).
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype, the size is set automatically from the longest entry in the first 50000 lines. If longer entries appeared later, they were silently truncated.
This is much faster (~30%) for loading actual structured dtypes (by skipping the recursive packer), somewhat faster (~5-10%) for large loads (>10000 rows, perhaps because shape inference of the final array is faster?), and much slower (nearly 2x) for very small loads (10 rows) or for reads using `dtype=object` (due to the extraneous limitation on object views, which could be fixed separately); however, the main point is to allow further optimizations.
This patch takes advantage of the possibility of assigning a tuple of *strs* to a structured dtype with e.g. float fields, and have the strs be implicitly converted to floats by numpy at the C-level. (A Python-level fallback is kept to support e.g. hex floats.) Together with the previous commit, this provides a massive speedup (~2x on the loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is beneficial with as little as 100 rows. Very small reads (10 rows) are still slower (nearly 2x for object), as well as reads using object dtypes (due to the extra copy), but the tradeoff seems worthwhile.
In the fast-path of loadtxt, the conversion to np.void implicitly checks the number of fields. Removing the explicit length check saves ~5% for the largest loads (100_000 rows) of numeric scalar types.
df5ee9f
to
a126896
Compare
Kindly bumping. |
@charris Anything I can do to move this forward? Thanks! (Sorry, I'm picking on you as you already left a review comment :-)) |
Mainly bringing it up in case it interests you, @anntzer. But part of the reason the momentum has stalled a bit here is that @rossbar and I have been looking to pushing forward That does not have to stand in the way here though. But, it would give us a good speed-up and additionally allows supporting new features, such as |
That sounds great; I guess it depends on the timescale over which you think npreadtext will make it into numpy. From a quick test npreadtext is faster than loadtxt even with the improvements here, but I am slightly worried that including a large chunk of C may take a while to review (wearing my matplotlib dev hat here), whereas the PR here may be faster to review (although it also involves some tricks). So if you think npreadtext can be merged relatively quickly (wrt. numpy's release schedule), I am fine with closing this PR and its followup; otherwise, perhaps it can still go in as a temporary stopgap improvement. |
Closing, as superseded by gh-20580, I don't think it is helpful to keep this open, unless the other PR gets rejected (which at this point seems a very long shot, I think it is finished and good). |
This PR builds on top of #19618 (to avoid a tricky rebase) and #19042 (which this PR would have fixed more or less naturally anyways, so I may as well credit @DFEvans for the test he wrote in that PR). (#19618 has now been merged, so this is now ready for review.)
The general idea is as follows:
First, we treat each row of loadtxt's input as a single item of a structured dtype with no nested structured dtypes, with as many fields as needed. If loadtxt was given a scalar dtype, then the structured dtype is constructed by creating as many fields (each with that scalar dtype) as there are columns; if a structured dtype was requested, then first flatten the dtype (as explained in genfromtxt fails when a non-contiguous dtype is requested #19623, but correctly taking offsets into account) and repeat the fields as needed. (Note that this also fixes a previous bug, whereby loading e.g.
"0 1 2 3 4\n5 6 7 8 9"
withdtype=[("a", int), ("b", int)]
would return[[(0, 1), (2, 3)], [(5, 6), (7, 8)]]
and silently drop the last column -- the old behavior seems clearly buggy.)Once the whole array is read, we then
.view()
back to the actually requested dtype. This implies an extraneous copy if the requested dtype has.hasobject = True
(which would be fixed by ENH: Make it possible to call .view on object arrays #8514); I believe that that case is rare enough to be ignored for now (and a fix is possible anyways).In itself, this is much faster (~30%) for loading actual structured dtypes (by skipping the recursive packer), somewhat faster (~5-10%) for large loads (>10_000 rows, perhaps because shape inference of the final array is faster?), and much slower (nearly 2x) for very small loads (10 rows) or for reads using
dtype=object
; however, the main point is to allow the next points.Then, we take advantage of the possibility of assigning a tuple of strs to a structured dtype with e.g. float fields, and have the strs be implicitly converted to floats by numpy at the C-level. (A Python-level fallback is kept to support e.g. hex floats.) Together with the previous commit, this provides a massive speedup (~2x on the loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is beneficial with as little as 100 rows. Very small reads (10 rows) are still slower (nearly 2x for object), as well as reads using object dtypes (still due to the extra copy), but the tradeoff seems, again, worthwhile.
Finally, using structured dtypes provides a small extra advantage, in that they implicitly check the number of fields in the input, and thus allow skipping the
len(words) == ncols
check; even that is a ~5% speedup for the largest loads (100_000 rows) of numeric scalar types.Overall, the benchmarks (compared to #19618) shown below indicate a >2x speedup for large reads of simple numeric types, and a slowdown of very small reads (~1.5x for 10 rows) or reads of object arrays (~10% for large reads, ~2x for small ones). But small reads are fast anyways and reading into object arrays is, again, likely rare.
Still, if one includes the earlier speedups to loadtxt that I posted recently, even these cases are faster than previously (see below), so I'll apply the credit of these earlier PRs towards this one :)