-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Move loadtxt
to C for much better speed
#20580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fb0e992
to
e5960b7
Compare
(The PyPy test failure should be the same as in the |
I admit I am not a big fan of converting working python code into 90% working C code by hand. It seems like the wrong approach for 2021. I would be interested to see what happens if we use cython to convert loadtxt (and its functions) to C. Does it achieve a similar speedup? In any case, can we leave this as an external package for now, especially given the open issues and design decisions. Perhaps this PR could target nreadtext instead of NumPy? |
I will probably write more on the mailing list later. The quick answer is, that I doubt it will help much, but can try
I doubt either will get within a factor of 2 for many use-cases, but can experiment. Moving the existing code to Cython does not seem a likely an improvement to me (since this is little Python interfacing code). Assuming I am not surprised by the above tries, we have to decide how important it is to us to get:
For something that is, for better or worse, probably core functionality. As much as I dislike the amount of code and would love to off-load or even ignore it, the status-quo where we write in our tutorial "please use pandas" is not great. (Personally, I always felt that if loading a CSV is too slow, it just proofs that I should not use CSV for that task. But in practice, I have to admit, I doubt that is the reality out in the wild.) |
Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster than Pandas, Dask touts the advantages of parallelization, so on and so forth. I don't deal with huge text files, but use of low level languages to speed up common use cases is pretty much why NumPy and SciPy exist, so I am not opposed in principle. The only problem I see is trading off the complexity/maintainability of a low level language vs Python. But I also note that NumPy is becoming more complex with SIMD and C++ templates, it is perhaps in the nature of things that performance drives development. |
3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio (and slower if one is happy with the default).
The pandas |
IME one big advantage of Cython over C -- especially if the Cython is pythonic and uses cython cimports rather than C function calls -- is the breadth of the base that is comfortable making PRs. |
Looking at this post it seems to me that unless one wants to do a multithreaded reader then the best case gains are pretty limited (if one wants to compare to Julia as a reference point). |
Yes, IIRC, compared to pandas this has basically parity or up to 50% faster (maybe more for integer parsing in some cases). But, I think you will find the C code to be much more maintainable and much easier to extend. Some notes:
Is Cython better? I guess opinions will differ on that. It is probably more accessible to contributions, but honestly, I am not sure I like Cython for code that is "heavy C". It is great for the interface code, but the interface code isn't the big chunk here. |
Pandas has the reputation of being the best/fastest Python text reader, and a long term goal has been to make NumPy as good as Pandas. Reaching that goal probably requires doing most of the work in C. It would be interesting to see just what makes the difference when the code is C rather than Python -- reading, parsing, writing, etc. -- but it does seem clear that all the fastest readers are written in low level languages for the common case. Both R and Pandas do that, and Julia is probably in the low level language camp. |
What are the benchmarks for the code? Is there something that exercises the non-float dtypes? |
The benchmarks @seberg added in the OP (click on the little black triangle to open the details tab) were computed using |
My 2 euro cents:
I tend to agree with this point of view. In addition, build system wise C is much nicer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite changes look fairly modest, and include some improved error messages it seems.
} | ||
|
||
/* | ||
* `item` must be the nul-terminated string that is to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item
doesn't seem to be an argument in the function below?
Also, just to be clear, when you use nul
and NUL
in this PR, you are always referring to the ASCII character with a bit pattern of zero, rather than the NULL
pointer to nowhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, \0
(as ascii or unicode). Seems I forgot to cleanup the comment(s). It can assume that the end of the string is a \0
, but the string may include \0
charcters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just deleted this comment for now, doesn't really seem useful, all of the to_...
functions are converters...
* `item` must be the nul-terminated string that is to be | ||
* converted to a double. | ||
* | ||
* To be successful, to_double() must use *all* the characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment about to_double()
is currently instead on top of to_float()
? And niether of them accepts an item
argument?
/* | ||
* The original code had some error details, but I assume that we don't need | ||
* it. Printing the string from which we tried to modify it should be fine. | ||
* This should potentially be public NumPy API, although it is tricky, NumPy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this sentence ends mid-thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much fully revised this doc-string, thanks.
Yeah, the main changes I am aware of are:
The other things should really be improvements, like the better errors (although sometimes changed) – could still add error chaining there, especially for user provided converters. In any case, thanks for taking a look! I will be off mostly for the next 1.5 weeks, so probably no updates :). |
756e274
to
1f25dfe
Compare
I can't resist to comment here, so sorry in advance ;) Long time ago I have written my own CSV parser. It had support for quotes, I think that should be pretty much standard. CSV parsers are just inherently complex and need to handle legacy, since this is what CSV is about to a large amount ("I want to read that old file from this legacy system XYZ here"). I totally agree with @seberg that it is about the structure of the code rather than the language it is written in (C or Python) to keep this complexity as low as possible and a CSV parser maintainable. Maintainability and reliability should be more important than speed anyway, since there is not so much speed to be gained, as even GPUs can only accelerate by a factor of 2-3x, compare https://medium.com/rapids-ai/reading-larger-than-memory-csvs-with-rapids-and-dask-e6e27dfa6c0f Having CSV in a workflow is frustrating. It is always slow and when dealing with floats (so probably most of the time), it is a reason for precision loss. In my workflows, I typically convert CSV to HDF or .npy first and create intermediate, binary cache files if needed to have short cycle times ("save .py, automated run (e.g. with entr), see results" should take only a few seconds). However, one nice thing about CSV is that it can be easily appended to, just by writing to the end of the file and at this point, I'd like to advertise my pull request a little more: This would make .npy appendable in the same way CSV can be appended to: just write some more data to the end of the file, no need to update headers. |
And I'd also like to add that most CSV parsers I've encountered (C, C++, JavaScript, Java) are simply broken and unmaintained. The one from Pandas is indeed the best I've seen so far, probably because of its large user base. |
It might be nice to see which of the 60 open issues with loadtxt or genfromtxt are relevant |
Frankly, I don't think it closes a lot. It closes some performance (speed/memory) relevant ones (although some of them mention "genfromtxt" as well):
It fixes/changes gh-19785 (as documented and we wanted to clean this part up slightly more), but I can also undo that. It also implements gh-9291, but it is just a small carrot if anything. A major thing is quoting support as e.g. requested in gh-2211 (but not for genfromtxt for now obviously).
So does it fix a huge amount of issues? No, not really. It addresses the few big ones that meant that everyone is told not to use loadtxt: performance/memory use and CSV features like quoting – full features can now be achieved, such as escaping. Warren had added support for things like EDIT: There are a couple of rewordings and a few simple tests to be added. The only bigger thing is to review documentation. (There is one possibility that specifying things like |
a60b95c
to
ca2d6ee
Compare
48d74b6
to
876365d
Compare
The `str` values for those weird values used for longdouble are truncated by PyPy's complex `str` output. Which seems fine probably since PyPy's `repr` does the right thing and will not truncate.
caf8b13
to
763a3d4
Compare
Adds some tests for the behavior of control characters, e.g. comments, delimiter and quotechar, when they have the same value. At this stage, these tests are more to frame the discussion about what the behavior should be, not to test what it currently is. I personally think raising an exception is correct for most of these situations, though it's worth noting that np.loadtxt currently doesn't for most of these corner cases (and seems to randomly assign precedence to delimiter over comments or vice versa depending on the values).
Includes comments param, which is handled on the Python side.
nrows gt chunksize.
The last of the small fixups that were queued up from previous reviews have been addressed, so this one is ready for review! |
something went wrong with CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this today and it is ready for merging once CI is green. There will be some follow-on PRs but this looks good so far. Thanks @WarrenWeckesser, @rossbar and @seberg.
da16a53
to
0422d4e
Compare
- Floats with underscores - Floats + hex floats.
0422d4e
to
ef7492c
Compare
Thanks Matti! A typo (or two) snuck into the last docs update; fixed now. the remaining failure is due to the scipy docs (general doc CI failure currently).
Yes, the main potential follow-ups will be looking into:
(Besides any potential issue fixes of course. I will probably send a brief heads-up email to the mailing list once it lands.) |
Thanks @seberg, @rossbar, @WarrenWeckesser |
This PR is based on @WarrenWeckesser
npreadtext
which was subsequently heavily modified here (and somewhat reduced in scope).While there is more to it, the main point is to improve the speed for parsing CSV (like) files, the included io benchmark change the following way:
Benchmarking results
(sorted so that the
ratio
is a factor, 17 means 17 times faster)EDIT: Last table update Jan 14 (no meaningful changes expected anymore)
Marking as draft, but I do not expect major changes, so this can get a first round of review, I think.
There is a small list of TODOs:
(e.g. I think I want to remove passingShould be mostly done.usecols
as a numpy array)Moving usecols could be done (and would be nice), but we don't have quite the right stuff available so added the correct sanity checks instead. I agree this is a wart (a private one), so if anyone insists I can look at it.Moved usecols as well, nothing is quite awesome, but I think it is better.The issues from
nreadtext
still apply, and we should review some of these or make a call on them. The issue tracker there may also be a good way to open other smaller things to fix-up (to reduce noise here).As a general overview, the parser has a tokenizer, that works with unicode, but supports the Python modes UCS4, UCS2, and UCS1. It parses everything into smaller UCS4 (unicode) strings, this copies the data – in theory not always necessary, but convenient – although future optimization could be done.
After this step, we have dtype (especially for integers) specific converter functions which directly convert the UCS4 string into the final result. If necessary, we fall back to
PyArray_Pack
.(Since we do not discover the result
dtype
, strings have to offload some work to Python currently.)In general, we read either line-by-line from Python iterables, or in chunks if reading from files.
The new features are currently fairly limited (e.g. I removed the custom float parser to support using
d
instead ofe
), and is not fully CVS parser featured (but this can easily be achieved now!).One particular important changes is the current decision to remove hex float parsing and limit integer parsing to integers (we used to fall back to floats). The custom parsers also mean that we do not support numbers with an underscore
100_000
right now.Our current idea was that we should improve the error message a bit more, but generally advise users to use
converters=int
in this case.Please check the
npreadtext
issue tracker for issues/changes: https://github.com/BIDS-numpy/npreadtext/issuesEDIT: One big additional feature
(which I have not made public here, but is as easy as adding the kwarg), is that the parser does support quoting (with excel double-quote escaping).