ENH: Move `loadtxt` to C for much better speed #20580

seberg · 2021-12-13T20:53:49Z

This PR is based on @WarrenWeckesser npreadtext which was subsequently heavily modified here (and somewhat reduced in scope).

While there is more to it, the main point is to improve the speed for parsing CSV (like) files, the included io benchmark change the following way:

Benchmarking results

(sorted so that the ratio is a factor, 17 means 17 times faster)
EDIT: Last table update Jan 14 (no meaningful changes expected anymore)

       before           after         ratio
     [1e6b72b4]       [e2d35064]
     <add-npreadtext>       <main>    
+     9.58±0.07ms          178±2ms    18.64  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)
+         949±5μs       17.5±0.1ms    18.42  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
+      8.30±0.1ms        142±0.1ms    17.16  bench_io.LoadtxtCSVStructured.time_loadtxt_csv_struct_dtype
+      9.11±0.1ms        131±0.2ms    14.40  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
+        918±10μs       13.1±0.1ms    14.24  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
+     13.4±0.08μs          188±1μs    14.05  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
+     13.1±0.05μs          143±1μs    10.91  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
+      15.6±0.4ms          135±1ms     8.67  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
+      14.6±0.1ms          127±1ms     8.67  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
+      14.5±0.1ms          125±1ms     8.58  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100000)
+     1.45±0.01ms      12.3±0.09ms     8.49  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10000)
+     15.1±0.07ms        126±0.7ms     8.40  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
+     1.48±0.02ms      12.4±0.03ms     8.39  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
+     1.60±0.01ms       13.4±0.1ms     8.38  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
+        1.52±0ms      12.5±0.05ms     8.23  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
+     19.1±0.03μs        146±0.2μs     7.63  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
+      18.4±0.2μs        135±0.9μs     7.33  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100)
+     18.7±0.08μs        137±0.5μs     7.30  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
+     18.9±0.06μs          137±2μs     7.21  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
+     1.55±0.01ms      10.3±0.05ms     6.66  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10000)
+     16.0±0.08ms          105±1ms     6.60  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
+        916±10μs      5.98±0.02ms     6.52  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3])
+     1.15±0.02ms      7.36±0.01ms     6.40  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3, 5, 7])
+      18.2±0.3μs          110±1μs     6.03  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100)
+        798±10μs      4.80±0.01ms     6.02  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)
+     5.07±0.01μs       29.7±0.7μs     5.86  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10)
+     7.36±0.08ms       39.7±0.2ms     5.39  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
+      77.1±0.5μs          413±3μs     5.35  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
+         737±7μs      3.92±0.01ms     5.32  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
+     5.06±0.02μs       25.1±0.5μs     4.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
+     12.3±0.02μs       58.3±0.5μs     4.74  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20)
+     5.67±0.03μs       26.3±0.4μs     4.64  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
+         220±4μs          989±4μs     4.49  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
+         221±5μs          987±8μs     4.46  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
+         123±3μs          549±2μs     4.46  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
+         124±1μs          548±2μs     4.41  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
+     2.22±0.03ms      9.76±0.04ms     4.41  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
+     2.22±0.03ms      9.69±0.04ms     4.37  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
+     5.62±0.01μs       24.2±0.8μs     4.31  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
+     5.64±0.01μs       24.0±0.1μs     4.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10)
+     5.44±0.03μs       22.8±0.3μs     4.20  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
+     5.90±0.02μs       21.8±0.6μs     3.69  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10)
+     4.22±0.01ms      12.7±0.08ms     3.02  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10000)
+      46.6±0.3μs        137±0.7μs     2.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100)
+      42.6±0.1ms        122±0.7ms     2.87  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100000)
+      73.1±0.7ms        202±0.4ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
+      72.6±0.7ms        200±0.7ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
+      66.1±0.9ms        182±0.6ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
+      10.2±0.2μs       24.3±0.1μs     2.39  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10)

Marking as draft, but I do not expect major changes, so this can get a first round of review, I think.

There is a small list of TODOs:

The current additional tests from npreadtext are missing here (they should be rewritten in terms of loadtxt probably).
Additional tests should be added (see also npreadtext issue tracker; one reason for the PR, it makes checking code-coverage more convenient.)
Smaller changes ~~(e.g. I think I want to remove passing usecols as a numpy array)~~ Should be mostly done. Moving usecols could be done (and would be nice), but we don't have quite the right stuff available so added the correct sanity checks instead. I agree this is a wart (a private one), so if anyone insists I can look at it. Moved usecols as well, nothing is quite awesome, but I think it is better.

The issues from nreadtext still apply, and we should review some of these or make a call on them. The issue tracker there may also be a good way to open other smaller things to fix-up (to reduce noise here).

As a general overview, the parser has a tokenizer, that works with unicode, but supports the Python modes UCS4, UCS2, and UCS1. It parses everything into smaller UCS4 (unicode) strings, this copies the data – in theory not always necessary, but convenient – although future optimization could be done.
After this step, we have dtype (especially for integers) specific converter functions which directly convert the UCS4 string into the final result. If necessary, we fall back to PyArray_Pack.
(Since we do not discover the result dtype, strings have to offload some work to Python currently.)

In general, we read either line-by-line from Python iterables, or in chunks if reading from files.

The new features are currently fairly limited (e.g. I removed the custom float parser to support using d instead of e), and is not fully CVS parser featured (but this can easily be achieved now!).
One particular important changes is the current decision to remove hex float parsing and limit integer parsing to integers (we used to fall back to floats). The custom parsers also mean that we do not support numbers with an underscore 100_000 right now.
Our current idea was that we should improve the error message a bit more, but generally advise users to use converters=int in this case.

Please check the npreadtext issue tracker for issues/changes: https://github.com/BIDS-numpy/npreadtext/issues

EDIT: One big additional feature ~~(which I have not made public here, but is as easy as adding the kwarg)~~, is that the parser does support quoting (with excel double-quote escaping).

seberg · 2021-12-13T22:27:27Z

(The PyPy test failure should be the same as in the empty/intp sequence parsing PR: PyPy had a bug with formatting %(100)R, it is fixed, and for now we should just ignore it.)

mattip · 2021-12-14T06:35:47Z

I admit I am not a big fan of converting working python code into 90% working C code by hand. It seems like the wrong approach for 2021.

I would be interested to see what happens if we use cython to convert loadtxt (and its functions) to C. Does it achieve a similar speedup?

In any case, can we leave this as an external package for now, especially given the open issues and design decisions. Perhaps this PR could target nreadtext instead of NumPy?

seberg · 2021-12-14T15:50:04Z

I will probably write more on the mailing list later. The quick answer is, that I doubt it will help much, but can try ~~quickly~~. I currently see two alternatives, but maybe I am lacking vision:

Use Python's csv module with fromiter
Try cython to convert the exist code largely +as is.

I doubt either will get within a factor of 2 for many use-cases, but can experiment. Moving the existing code to Cython does not seem a likely an improvement to me (since this is little Python interfacing code).

Assuming I am not surprised by the above tries, we have to decide how important it is to us to get:

A parser that is not at least 2× slower than pandas (for many use-cases)
access to CSV features (e.g. quoting)

For something that is, for better or worse, probably core functionality. As much as I dislike the amount of code and would love to off-load or even ignore it, the status-quo where we write in our tutorial "please use pandas" is not great. (Personally, I always felt that if loading a CSV is too slow, it just proofs that I should not use CSV for that task. But in practice, I have to admit, I doubt that is the reality out in the wild.)

charris · 2021-12-14T17:30:28Z

Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster than Pandas, Dask touts the advantages of parallelization, so on and so forth. I don't deal with huge text files, but use of low level languages to speed up common use cases is pretty much why NumPy and SciPy exist, so I am not opposed in principle. The only problem I see is trading off the complexity/maintainability of a low level language vs Python. But I also note that NumPy is becoming more complex with SIMD and C++ templates, it is perhaps in the nature of things that performance drives development.

bashtage · 2021-12-14T18:03:17Z

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio (and slower if one is happy with the default).

>>> c = pd.read_csv("test.csv", delimiter=",")
96.2 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip")
322 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> e = _loadtxt("test.csv", delimiter=",")
215 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The pandas read_csv has a really large user base and so benefits accordingly in terms of edge cases.

bashtage · 2021-12-14T18:09:31Z

I would be interested to see what happens if we use cython to convert loadtxt (and its functions) to C. Does it achieve a similar speedup?

IME one big advantage of Cython over C -- especially if the Cython is pythonic and uses cython cimports rather than C function calls -- is the breadth of the base that is comfortable making PRs.

bashtage · 2021-12-14T18:14:44Z

Looking at this post it seems to me that unless one wants to do a multithreaded reader then the best case gains are pretty limited (if one wants to compare to Julia as a reference point).

seberg · 2021-12-14T18:54:33Z

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio (and slower if one is happy with the default).

Yes, IIRC, compared to pandas this has basically parity or up to 50% faster (maybe more for integer parsing in some cases). But, I think you will find the C code to be much more maintainable and much easier to extend. Some notes:

To get feature-parity with loadtxt, the pandas version must use its Python backend which is also extremely complex (and very slow of course). This is for things like:
- Reading any file that isn't a raw utf-8 file.
- Use any control character that is not ASCII. E.g. splitting by whitespace delimiter=None doesn't work, pandas only supports ascii whitespace with any decent speed (even with delimiter=',' non-ascii whitespace will not get stripped when ascii-whitespace does get stripped.
On the contrary, here the reading itself is still largely using Python API to keep it simple and flexible (with a line-by-line vs. chunked mode adding a bit of complexity for ~30% speedup, I admit). Meaning that all of those cases are covered by the same code.
I am very sure that if you will have much more problems extending the pandas code. I wrote this in a way that it should be straight forward to e.g. allow users to supply parsers written in C directly, or allow swapping out our parsers (if you want to have that fast float parser). I like the idea to make it so that users can supply converters=faster_float_parser_written_in_C. (yes, we would have to nail down public API for that, but the general design is there.)
The pandas code is much less maintainable also IMO. It has as many lines of code, but is not nearly as well structured.

Is Cython better? I guess opinions will differ on that. It is probably more accessible to contributions, but honestly, I am not sure I like Cython for code that is "heavy C". It is great for the interface code, but the interface code isn't the big chunk here.

charris · 2021-12-14T19:11:35Z

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio

Pandas has the reputation of being the best/fastest Python text reader, and a long term goal has been to make NumPy as good as Pandas. Reaching that goal probably requires doing most of the work in C. It would be interesting to see just what makes the difference when the code is C rather than Python -- reading, parsing, writing, etc. -- but it does seem clear that all the fastest readers are written in low level languages for the common case. Both R and Pandas do that, and Julia is probably in the low level language camp.

mattip · 2021-12-14T20:33:29Z

What are the benchmarks for the code? Is there something that exercises the non-float dtypes?

rossbar · 2021-12-14T20:52:39Z

What are the benchmarks for the code? Is there something that exercises the non-float dtypes?

The benchmarks @seberg added in the OP (click on the little black triangle to open the details tab) were computed using bench_io from the benchmark suite. You should be able to test/reproduce locally w/ something like python runtests.py --bench-compare main add-npreadtext bench_io

rgommers · 2021-12-14T21:11:11Z

My 2 euro cents:

We have a ton of open loadtxt/genfromtxt issues, and this has been a mess for years. That plus the slowness justifies a rewrite.
I wouldn't have had a very clear preference between adopting the Pandas reader (well-tested by now) or rewriting from scratch (cleaner, bit faster code) without spending quite a bit of time looking into the problem. I'd lean towards the Pandas one mostly to avoid spending a lot of developer time. That said, now that the work is mostly done I'm happy to trust @seberg & co that this PR is the right choice by now.
It looks like a low-stakes decision to merge this PR once it's complete. It's then faster and better structured than the Pandas reader, and benchmarks look great. So if the tests pass (and ideally it fixes a bunch of bugs as well), then +1 for merging. In the (quite unlikely) case it turns out to be horribly buggy, revert it in 1.23.1 and try again or reconsider.

Is Cython better? I guess opinions will differ on that. It is probably more accessible to contributions, but honestly, I am not sure I like Cython for code that is "heavy C". It is great for the interface code, but the interface code isn't the big chunk here.

I tend to agree with this point of view. In addition, build system wise C is much nicer.

tylerjereddy

The test suite changes look fairly modest, and include some improved error messages it seems.

tylerjereddy · 2021-12-24T18:34:54Z

numpy/core/src/multiarray/textreading/conversions.c

+}
+
+/*
+ *  `item` must be the nul-terminated string that is to be


item doesn't seem to be an argument in the function below?

Also, just to be clear, when you use nul and NUL in this PR, you are always referring to the ASCII character with a bit pattern of zero, rather than the NULL pointer to nowhere?

yeah, \0 (as ascii or unicode). Seems I forgot to cleanup the comment(s). It can assume that the end of the string is a \0, but the string may include \0 charcters.

Just deleted this comment for now, doesn't really seem useful, all of the to_... functions are converters...

tylerjereddy · 2021-12-24T18:39:41Z

numpy/core/src/multiarray/textreading/conversions.c

+ *  `item` must be the nul-terminated string that is to be
+ *  converted to a double.
+ *
+ *  To be successful, to_double() must use *all* the characters


This comment about to_double() is currently instead on top of to_float()? And niether of them accepts an item argument?

tylerjereddy · 2021-12-24T18:44:39Z

numpy/core/src/multiarray/textreading/field_types.h

+/*
+ * The original code had some error details, but I assume that we don't need
+ * it.  Printing the string from which we tried to modify it should be fine.
+ * This should potentially be public NumPy API, although it is tricky, NumPy


Looks like this sentence ends mid-thought?

Pretty much fully revised this doc-string, thanks.

numpy/core/src/multiarray/textreading/tokenize.c.src

numpy/lib/npyio.py

seberg · 2021-12-24T20:35:23Z

The test suite changes look fairly modest, and include some improved error messages it seems.

Yeah, the main changes I am aware of are:

more restrictive parsing: no floats when storing integers, no underscore 100_000 support (can be gotten back using converters=float although a bit annoying if you try to use both versions, since the old one requires listing all columns.
There is some strange stuff about counting lines vs. counting rows: loadtxt strange handling empty lines with max_rows #19785 (this can be fixed in any way pretty much, but needs a decision). Also if you pass in rows that contain newlines, things get a bit weird. I did not care much, because it seems super strange, but it is a change

The other things should really be improvements, like the better errors (although sometimes changed) – could still add error chaining there, especially for user provided converters.

In any case, thanks for taking a look! I will be off mostly for the next 1.5 weeks, so probably no updates :).

xor2k · 2022-01-09T12:47:17Z

I can't resist to comment here, so sorry in advance ;)

Long time ago I have written my own CSV parser. It had support for quotes, I think that should be pretty much standard. CSV parsers are just inherently complex and need to handle legacy, since this is what CSV is about to a large amount ("I want to read that old file from this legacy system XYZ here"). I totally agree with @seberg that it is about the structure of the code rather than the language it is written in (C or Python) to keep this complexity as low as possible and a CSV parser maintainable. Maintainability and reliability should be more important than speed anyway, since there is not so much speed to be gained, as even GPUs can only accelerate by a factor of 2-3x, compare

https://medium.com/rapids-ai/reading-larger-than-memory-csvs-with-rapids-and-dask-e6e27dfa6c0f

Having CSV in a workflow is frustrating. It is always slow and when dealing with floats (so probably most of the time), it is a reason for precision loss. In my workflows, I typically convert CSV to HDF or .npy first and create intermediate, binary cache files if needed to have short cycle times ("save .py, automated run (e.g. with entr), see results" should take only a few seconds). However, one nice thing about CSV is that it can be easily appended to, just by writing to the end of the file and at this point, I'd like to advertise my pull request a little more:

#20321

This would make .npy appendable in the same way CSV can be appended to: just write some more data to the end of the file, no need to update headers.

xor2k · 2022-01-09T12:53:20Z

And I'd also like to add that most CSV parsers I've encountered (C, C++, JavaScript, Java) are simply broken and unmaintained. The one from Pandas is indeed the best I've seen so far, probably because of its large user base.

mattip · 2022-01-11T08:01:40Z

We have a ton of open loadtxt/genfromtxt issues

It might be nice to see which of the 60 open issues with loadtxt or genfromtxt are relevant

seberg · 2022-01-11T14:59:27Z

Frankly, I don't think it closes a lot. It closes some performance (speed/memory) relevant ones (although some of them mention "genfromtxt" as well):

It fixes/changes gh-19785 (as documented and we wanted to clean this part up slightly more), but I can also undo that. It also implements gh-9291, but it is just a small carrot if anything. A major thing is quoting support as e.g. requested in gh-2211 (but not for genfromtxt for now obviously).
I think there are a couple of others that may just not be listed anywhere, like:

Using float for int conversions is clearly incorrect for some values.
Better error messages

So does it fix a huge amount of issues? No, not really. It addresses the few big ones that meant that everyone is told not to use loadtxt: performance/memory use and CSV features like quoting – full features can now be achieved, such as escaping.
For most issues, moving it to C doesn't really change all that much about how hard/easy it would be to address them.

Warren had added support for things like decimal=",", I ripped it out because that chunk relied on vendoring 3000 lines of float parsing code and I prefer to start minimal. But that can be re-added also without vendoring, by adding it to a translation step (which really should not make a big difference on the scale of float parsing).

EDIT: There are a couple of rewordings and a few simple tests to be added. The only bigger thing is to review documentation. (There is one possibility that specifying things like quotechar="#", delimiter="#" can misbehave in some cases and should error explicit, although I somewhat doubt it.)

The `str` values for those weird values used for longdouble are truncated by PyPy's complex `str` output. Which seems fine probably since PyPy's `repr` does the right thing and will not truncate.

Adds some tests for the behavior of control characters, e.g. comments, delimiter and quotechar, when they have the same value. At this stage, these tests are more to frame the discussion about what the behavior should be, not to test what it currently is. I personally think raising an exception is correct for most of these situations, though it's worth noting that np.loadtxt currently doesn't for most of these corner cases (and seems to randomly assign precedence to delimiter over comments or vice versa depending on the values).

Includes comments param, which is handled on the Python side.

nrows gt chunksize.

rossbar · 2022-01-28T21:33:32Z

The last of the small fixups that were queued up from previous reviews have been addressed, so this one is ready for review!

mattip · 2022-02-07T21:47:42Z

something went wrong with CI

mattip

We discussed this today and it is ready for merging once CI is green. There will be some follow-on PRs but this looks good so far. Thanks @WarrenWeckesser, @rossbar and @seberg.

- Floats with underscores - Floats + hex floats.

seberg · 2022-02-07T22:55:58Z

Thanks Matti! A typo (or two) snuck into the last docs update; fixed now. the remaining failure is due to the scipy docs (general doc CI failure currently).

There will be some follow-on PRs but this looks good so far.

Yes, the main potential follow-ups will be looking into:

removing that Python 2 compatibility encoding="bytes" shim.
renaming skip_rows to skip_lines to make it clearer.

(Besides any potential issue fixes of course. I will probably send a brief heads-up email to the mailing list once it lands.)

mattip · 2022-02-08T07:40:11Z

Thanks @seberg, @rossbar, @WarrenWeckesser

github-actions bot added the 01 - Enhancement label Dec 13, 2021

seberg force-pushed the add-npreadtext branch from fb0e992 to e5960b7 Compare December 13, 2021 21:43

mattip added the triage review Issue/PR to be discussed at the next triage meeting label Dec 14, 2021

tylerjereddy reviewed Dec 24, 2021

View reviewed changes

seberg force-pushed the add-npreadtext branch 3 times, most recently from 756e274 to 1f25dfe Compare January 8, 2022 00:24

seberg force-pushed the add-npreadtext branch 4 times, most recently from a60b95c to ca2d6ee Compare January 13, 2022 21:19

seberg removed the triage review Issue/PR to be discussed at the next triage meeting label Jan 14, 2022

seberg force-pushed the add-npreadtext branch from 48d74b6 to 876365d Compare January 14, 2022 04:53

TST: Use repr in byteswapping tests

763a3d4

The `str` values for those weird values used for longdouble are truncated by PyPy's complex `str` output. Which seems fine probably since PyPy's `repr` does the right thing and will not truncate.

seberg force-pushed the add-npreadtext branch from caf8b13 to 763a3d4 Compare January 20, 2022 02:11

rossbar added 7 commits January 28, 2022 11:17

Add test for datetime parametric unit discovery.

0576327

Add test for unicode, parametrize for chunksize.

370792b

Add test for empty string as control characters.

8a31abc

Includes comments param, which is handled on the Python side.

Add test for str dtype len discovery with converters.

0ee03c8

nrows gt chunksize.

Handle delimiter as bytes.

bab8610

Linting.

5332a41

TST: Fix exception msg matching in tests.

59c2084

seberg mentioned this pull request Jan 30, 2022

ENH: Optimize np.empty for scalar arguments #20175

Merged

TST: Skip error test using on PyPy (test uses %.100R)

a756bfb

mattip approved these changes Feb 7, 2022

View reviewed changes

seberg force-pushed the add-npreadtext branch from da16a53 to 0422d4e Compare February 7, 2022 21:54

Add two new examples of converters to docstring examples

ef7492c

- Floats with underscores - Floats + hex floats.

seberg force-pushed the add-npreadtext branch from 0422d4e to ef7492c Compare February 7, 2022 22:06

mattip merged commit 4b3ddd7 into numpy:main Feb 8, 2022

This was referenced Feb 8, 2022

BUG: fix string truncation bug in loadtxt #19042

Closed

loadtxt() changes numbers if integers are read as strings #17277

Closed

Use converters for all columns #9291

Closed

pllim mentioned this pull request Feb 8, 2022

Consider using numpy loadtxt under the hood for fast ASCII reading astropy/astropy#12829

Open

seberg deleted the add-npreadtext branch February 8, 2022 17:36

BvB93 mentioned this pull request May 9, 2022

TYP,ENH: Add annotations for new numpy 1.23 features #21479

Merged

zhiqwang mentioned this pull request Jun 26, 2022

numpy does not support newline delimiter from version 1.23 zhiqwang/yolort#430

Closed

David-Baddeley mentioned this pull request Oct 12, 2022

[BUG] Large data files python-microscopy/python-microscopy#1293

Open

DimitriPapadopoulos mentioned this pull request Jul 11, 2024

Apply some ruff/flake8-bugbear rules (B004 and B005) #26902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Move `loadtxt` to C for much better speed #20580

ENH: Move `loadtxt` to C for much better speed #20580

seberg commented Dec 13, 2021 •

edited

Loading

seberg commented Dec 13, 2021

mattip commented Dec 14, 2021

seberg commented Dec 14, 2021 •

edited

Loading

charris commented Dec 14, 2021

bashtage commented Dec 14, 2021 •

edited

Loading

bashtage commented Dec 14, 2021

bashtage commented Dec 14, 2021

seberg commented Dec 14, 2021 •

edited

Loading

charris commented Dec 14, 2021

mattip commented Dec 14, 2021

rossbar commented Dec 14, 2021

rgommers commented Dec 14, 2021

tylerjereddy left a comment

tylerjereddy Dec 24, 2021

seberg Dec 24, 2021

seberg Jan 6, 2022

tylerjereddy Dec 24, 2021

tylerjereddy Dec 24, 2021

seberg Jan 6, 2022

seberg commented Dec 24, 2021

xor2k commented Jan 9, 2022

xor2k commented Jan 9, 2022

mattip commented Jan 11, 2022

seberg commented Jan 11, 2022 •

edited

Loading

rossbar commented Jan 28, 2022

mattip commented Feb 7, 2022

mattip left a comment

seberg commented Feb 7, 2022

mattip commented Feb 8, 2022

ENH: Move loadtxt to C for much better speed #20580

ENH: Move loadtxt to C for much better speed #20580

Conversation

seberg commented Dec 13, 2021 • edited Loading

seberg commented Dec 13, 2021

mattip commented Dec 14, 2021

seberg commented Dec 14, 2021 • edited Loading

charris commented Dec 14, 2021

bashtage commented Dec 14, 2021 • edited Loading

bashtage commented Dec 14, 2021

bashtage commented Dec 14, 2021

seberg commented Dec 14, 2021 • edited Loading

charris commented Dec 14, 2021

mattip commented Dec 14, 2021

rossbar commented Dec 14, 2021

rgommers commented Dec 14, 2021

tylerjereddy left a comment

Choose a reason for hiding this comment

tylerjereddy Dec 24, 2021

Choose a reason for hiding this comment

seberg Dec 24, 2021

Choose a reason for hiding this comment

seberg Jan 6, 2022

Choose a reason for hiding this comment

tylerjereddy Dec 24, 2021

Choose a reason for hiding this comment

tylerjereddy Dec 24, 2021

Choose a reason for hiding this comment

seberg Jan 6, 2022

Choose a reason for hiding this comment

seberg commented Dec 24, 2021

xor2k commented Jan 9, 2022

xor2k commented Jan 9, 2022

mattip commented Jan 11, 2022

seberg commented Jan 11, 2022 • edited Loading

rossbar commented Jan 28, 2022

mattip commented Feb 7, 2022

mattip left a comment

Choose a reason for hiding this comment

seberg commented Feb 7, 2022

mattip commented Feb 8, 2022

ENH: Move `loadtxt` to C for much better speed #20580

ENH: Move `loadtxt` to C for much better speed #20580

seberg commented Dec 13, 2021 •

edited

Loading

seberg commented Dec 14, 2021 •

edited

Loading

bashtage commented Dec 14, 2021 •

edited

Loading

seberg commented Dec 14, 2021 •

edited

Loading

seberg commented Jan 11, 2022 •

edited

Loading