Skip to content

ENH: Move loadtxt to C for much better speed #20580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 70 commits into from
Feb 8, 2022
Merged

Conversation

seberg
Copy link
Member

@seberg seberg commented Dec 13, 2021

This PR is based on @WarrenWeckesser npreadtext which was subsequently heavily modified here (and somewhat reduced in scope).

While there is more to it, the main point is to improve the speed for parsing CSV (like) files, the included io benchmark change the following way:

Benchmarking results

(sorted so that the ratio is a factor, 17 means 17 times faster)
EDIT: Last table update Jan 14 (no meaningful changes expected anymore)

       before           after         ratio
     [1e6b72b4]       [e2d35064]
     <add-npreadtext>       <main>    
+     9.58±0.07ms          178±2ms    18.64  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)
+         949±5μs       17.5±0.1ms    18.42  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
+      8.30±0.1ms        142±0.1ms    17.16  bench_io.LoadtxtCSVStructured.time_loadtxt_csv_struct_dtype
+      9.11±0.1ms        131±0.2ms    14.40  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
+        918±10μs       13.1±0.1ms    14.24  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
+     13.4±0.08μs          188±1μs    14.05  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
+     13.1±0.05μs          143±1μs    10.91  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
+      15.6±0.4ms          135±1ms     8.67  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
+      14.6±0.1ms          127±1ms     8.67  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
+      14.5±0.1ms          125±1ms     8.58  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100000)
+     1.45±0.01ms      12.3±0.09ms     8.49  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10000)
+     15.1±0.07ms        126±0.7ms     8.40  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
+     1.48±0.02ms      12.4±0.03ms     8.39  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
+     1.60±0.01ms       13.4±0.1ms     8.38  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
+        1.52±0ms      12.5±0.05ms     8.23  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
+     19.1±0.03μs        146±0.2μs     7.63  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
+      18.4±0.2μs        135±0.9μs     7.33  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100)
+     18.7±0.08μs        137±0.5μs     7.30  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
+     18.9±0.06μs          137±2μs     7.21  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
+     1.55±0.01ms      10.3±0.05ms     6.66  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10000)
+     16.0±0.08ms          105±1ms     6.60  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
+        916±10μs      5.98±0.02ms     6.52  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3])
+     1.15±0.02ms      7.36±0.01ms     6.40  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3, 5, 7])
+      18.2±0.3μs          110±1μs     6.03  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100)
+        798±10μs      4.80±0.01ms     6.02  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)
+     5.07±0.01μs       29.7±0.7μs     5.86  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10)
+     7.36±0.08ms       39.7±0.2ms     5.39  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
+      77.1±0.5μs          413±3μs     5.35  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
+         737±7μs      3.92±0.01ms     5.32  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
+     5.06±0.02μs       25.1±0.5μs     4.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
+     12.3±0.02μs       58.3±0.5μs     4.74  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20)
+     5.67±0.03μs       26.3±0.4μs     4.64  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
+         220±4μs          989±4μs     4.49  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
+         221±5μs          987±8μs     4.46  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
+         123±3μs          549±2μs     4.46  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
+         124±1μs          548±2μs     4.41  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
+     2.22±0.03ms      9.76±0.04ms     4.41  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
+     2.22±0.03ms      9.69±0.04ms     4.37  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
+     5.62±0.01μs       24.2±0.8μs     4.31  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
+     5.64±0.01μs       24.0±0.1μs     4.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10)
+     5.44±0.03μs       22.8±0.3μs     4.20  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
+     5.90±0.02μs       21.8±0.6μs     3.69  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10)
+     4.22±0.01ms      12.7±0.08ms     3.02  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10000)
+      46.6±0.3μs        137±0.7μs     2.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100)
+      42.6±0.1ms        122±0.7ms     2.87  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100000)
+      73.1±0.7ms        202±0.4ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
+      72.6±0.7ms        200±0.7ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
+      66.1±0.9ms        182±0.6ms     2.76  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
+      10.2±0.2μs       24.3±0.1μs     2.39  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10)

Marking as draft, but I do not expect major changes, so this can get a first round of review, I think.

There is a small list of TODOs:

  • The current additional tests from npreadtext are missing here (they should be rewritten in terms of loadtxt probably).
  • Additional tests should be added (see also npreadtext issue tracker; one reason for the PR, it makes checking code-coverage more convenient.)
  • Smaller changes (e.g. I think I want to remove passing usecols as a numpy array) Should be mostly done. Moving usecols could be done (and would be nice), but we don't have quite the right stuff available so added the correct sanity checks instead. I agree this is a wart (a private one), so if anyone insists I can look at it. Moved usecols as well, nothing is quite awesome, but I think it is better.

The issues from nreadtext still apply, and we should review some of these or make a call on them. The issue tracker there may also be a good way to open other smaller things to fix-up (to reduce noise here).


As a general overview, the parser has a tokenizer, that works with unicode, but supports the Python modes UCS4, UCS2, and UCS1. It parses everything into smaller UCS4 (unicode) strings, this copies the data – in theory not always necessary, but convenient – although future optimization could be done.
After this step, we have dtype (especially for integers) specific converter functions which directly convert the UCS4 string into the final result. If necessary, we fall back to PyArray_Pack.
(Since we do not discover the result dtype, strings have to offload some work to Python currently.)

In general, we read either line-by-line from Python iterables, or in chunks if reading from files.

The new features are currently fairly limited (e.g. I removed the custom float parser to support using d instead of e), and is not fully CVS parser featured (but this can easily be achieved now!).
One particular important changes is the current decision to remove hex float parsing and limit integer parsing to integers (we used to fall back to floats). The custom parsers also mean that we do not support numbers with an underscore 100_000 right now.
Our current idea was that we should improve the error message a bit more, but generally advise users to use converters=int in this case.

Please check the npreadtext issue tracker for issues/changes: https://github.com/BIDS-numpy/npreadtext/issues

EDIT: One big additional feature (which I have not made public here, but is as easy as adding the kwarg), is that the parser does support quoting (with excel double-quote escaping).

@seberg
Copy link
Member Author

seberg commented Dec 13, 2021

(The PyPy test failure should be the same as in the empty/intp sequence parsing PR: PyPy had a bug with formatting %(100)R, it is fixed, and for now we should just ignore it.)

@mattip
Copy link
Member

mattip commented Dec 14, 2021

I admit I am not a big fan of converting working python code into 90% working C code by hand. It seems like the wrong approach for 2021.

I would be interested to see what happens if we use cython to convert loadtxt (and its functions) to C. Does it achieve a similar speedup?

In any case, can we leave this as an external package for now, especially given the open issues and design decisions. Perhaps this PR could target nreadtext instead of NumPy?

@seberg
Copy link
Member Author

seberg commented Dec 14, 2021

I will probably write more on the mailing list later. The quick answer is, that I doubt it will help much, but can try quickly. I currently see two alternatives, but maybe I am lacking vision:

  • Use Python's csv module with fromiter
  • Try cython to convert the exist code largely +as is.

I doubt either will get within a factor of 2 for many use-cases, but can experiment. Moving the existing code to Cython does not seem a likely an improvement to me (since this is little Python interfacing code).

Assuming I am not surprised by the above tries, we have to decide how important it is to us to get:

  • A parser that is not at least 2× slower than pandas (for many use-cases)
  • access to CSV features (e.g. quoting)

For something that is, for better or worse, probably core functionality. As much as I dislike the amount of code and would love to off-load or even ignore it, the status-quo where we write in our tutorial "please use pandas" is not great. (Personally, I always felt that if loading a CSV is too slow, it just proofs that I should not use CSV for that task. But in practice, I have to admit, I doubt that is the reality out in the wild.)

@mattip mattip added the triage review Issue/PR to be discussed at the next triage meeting label Dec 14, 2021
@charris
Copy link
Member

charris commented Dec 14, 2021

Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster than Pandas, Dask touts the advantages of parallelization, so on and so forth. I don't deal with huge text files, but use of low level languages to speed up common use cases is pretty much why NumPy and SciPy exist, so I am not opposed in principle. The only problem I see is trading off the complexity/maintainability of a low level language vs Python. But I also note that NumPy is becoming more complex with SIMD and C++ templates, it is perhaps in the nature of things that performance drives development.

@bashtage
Copy link
Contributor

bashtage commented Dec 14, 2021

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio (and slower if one is happy with the default).

>>> c = pd.read_csv("test.csv", delimiter=",")
96.2 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip")
322 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> e = _loadtxt("test.csv", delimiter=",")
215 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The pandas read_csv has a really large user base and so benefits accordingly in terms of edge cases.

@bashtage
Copy link
Contributor

I would be interested to see what happens if we use cython to convert loadtxt (and its functions) to C. Does it achieve a similar speedup?

IME one big advantage of Cython over C -- especially if the Cython is pythonic and uses cython cimports rather than C function calls -- is the breadth of the base that is comfortable making PRs.

@bashtage
Copy link
Contributor

Looking at this post it seems to me that unless one wants to do a multithreaded reader then the best case gains are pretty limited (if one wants to compare to Julia as a reference point).

@seberg
Copy link
Member Author

seberg commented Dec 14, 2021

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio (and slower if one is happy with the default).

Yes, IIRC, compared to pandas this has basically parity or up to 50% faster (maybe more for integer parsing in some cases). But, I think you will find the C code to be much more maintainable and much easier to extend. Some notes:

  • To get feature-parity with loadtxt, the pandas version must use its Python backend which is also extremely complex (and very slow of course). This is for things like:

    • Reading any file that isn't a raw utf-8 file.
    • Use any control character that is not ASCII. E.g. splitting by whitespace delimiter=None doesn't work, pandas only supports ascii whitespace with any decent speed (even with delimiter=',' non-ascii whitespace will not get stripped when ascii-whitespace does get stripped.

    On the contrary, here the reading itself is still largely using Python API to keep it simple and flexible (with a line-by-line vs. chunked mode adding a bit of complexity for ~30% speedup, I admit). Meaning that all of those cases are covered by the same code.

  • I am very sure that if you will have much more problems extending the pandas code. I wrote this in a way that it should be straight forward to e.g. allow users to supply parsers written in C directly, or allow swapping out our parsers (if you want to have that fast float parser). I like the idea to make it so that users can supply converters=faster_float_parser_written_in_C. (yes, we would have to nail down public API for that, but the general design is there.)

  • The pandas code is much less maintainable also IMO. It has as many lines of code, but is not nearly as well structured.

Is Cython better? I guess opinions will differ on that. It is probably more accessible to contributions, but honestly, I am not sure I like Cython for code that is "heavy C". It is great for the interface code, but the interface code isn't the big chunk here.

@charris
Copy link
Member

charris commented Dec 14, 2021

3k lines for a 50% improvement over pandas when using maximum float precision seems like a less-than-ideal ratio

Pandas has the reputation of being the best/fastest Python text reader, and a long term goal has been to make NumPy as good as Pandas. Reaching that goal probably requires doing most of the work in C. It would be interesting to see just what makes the difference when the code is C rather than Python -- reading, parsing, writing, etc. -- but it does seem clear that all the fastest readers are written in low level languages for the common case. Both R and Pandas do that, and Julia is probably in the low level language camp.

@mattip
Copy link
Member

mattip commented Dec 14, 2021

What are the benchmarks for the code? Is there something that exercises the non-float dtypes?

@rossbar
Copy link
Contributor

rossbar commented Dec 14, 2021

What are the benchmarks for the code? Is there something that exercises the non-float dtypes?

The benchmarks @seberg added in the OP (click on the little black triangle to open the details tab) were computed using bench_io from the benchmark suite. You should be able to test/reproduce locally w/ something like python runtests.py --bench-compare main add-npreadtext bench_io

@rgommers
Copy link
Member

My 2 euro cents:

  • We have a ton of open loadtxt/genfromtxt issues, and this has been a mess for years. That plus the slowness justifies a rewrite.
  • I wouldn't have had a very clear preference between adopting the Pandas reader (well-tested by now) or rewriting from scratch (cleaner, bit faster code) without spending quite a bit of time looking into the problem. I'd lean towards the Pandas one mostly to avoid spending a lot of developer time. That said, now that the work is mostly done I'm happy to trust @seberg & co that this PR is the right choice by now.
  • It looks like a low-stakes decision to merge this PR once it's complete. It's then faster and better structured than the Pandas reader, and benchmarks look great. So if the tests pass (and ideally it fixes a bunch of bugs as well), then +1 for merging. In the (quite unlikely) case it turns out to be horribly buggy, revert it in 1.23.1 and try again or reconsider.

Is Cython better? I guess opinions will differ on that. It is probably more accessible to contributions, but honestly, I am not sure I like Cython for code that is "heavy C". It is great for the interface code, but the interface code isn't the big chunk here.

I tend to agree with this point of view. In addition, build system wise C is much nicer.

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite changes look fairly modest, and include some improved error messages it seems.

}

/*
* `item` must be the nul-terminated string that is to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item doesn't seem to be an argument in the function below?

Also, just to be clear, when you use nul and NUL in this PR, you are always referring to the ASCII character with a bit pattern of zero, rather than the NULL pointer to nowhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, \0 (as ascii or unicode). Seems I forgot to cleanup the comment(s). It can assume that the end of the string is a \0, but the string may include \0 charcters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just deleted this comment for now, doesn't really seem useful, all of the to_... functions are converters...

* `item` must be the nul-terminated string that is to be
* converted to a double.
*
* To be successful, to_double() must use *all* the characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment about to_double() is currently instead on top of to_float()? And niether of them accepts an item argument?

/*
* The original code had some error details, but I assume that we don't need
* it. Printing the string from which we tried to modify it should be fine.
* This should potentially be public NumPy API, although it is tricky, NumPy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this sentence ends mid-thought?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much fully revised this doc-string, thanks.

@seberg
Copy link
Member Author

seberg commented Dec 24, 2021

The test suite changes look fairly modest, and include some improved error messages it seems.

Yeah, the main changes I am aware of are:

  • more restrictive parsing: no floats when storing integers, no underscore 100_000 support (can be gotten back using converters=float although a bit annoying if you try to use both versions, since the old one requires listing all columns.
  • There is some strange stuff about counting lines vs. counting rows: loadtxt strange handling empty lines with max_rows #19785 (this can be fixed in any way pretty much, but needs a decision). Also if you pass in rows that contain newlines, things get a bit weird. I did not care much, because it seems super strange, but it is a change

The other things should really be improvements, like the better errors (although sometimes changed) – could still add error chaining there, especially for user provided converters.

In any case, thanks for taking a look! I will be off mostly for the next 1.5 weeks, so probably no updates :).

@seberg seberg force-pushed the add-npreadtext branch 3 times, most recently from 756e274 to 1f25dfe Compare January 8, 2022 00:24
@xor2k
Copy link
Contributor

xor2k commented Jan 9, 2022

I can't resist to comment here, so sorry in advance ;)

Long time ago I have written my own CSV parser. It had support for quotes, I think that should be pretty much standard. CSV parsers are just inherently complex and need to handle legacy, since this is what CSV is about to a large amount ("I want to read that old file from this legacy system XYZ here"). I totally agree with @seberg that it is about the structure of the code rather than the language it is written in (C or Python) to keep this complexity as low as possible and a CSV parser maintainable. Maintainability and reliability should be more important than speed anyway, since there is not so much speed to be gained, as even GPUs can only accelerate by a factor of 2-3x, compare

https://medium.com/rapids-ai/reading-larger-than-memory-csvs-with-rapids-and-dask-e6e27dfa6c0f

Having CSV in a workflow is frustrating. It is always slow and when dealing with floats (so probably most of the time), it is a reason for precision loss. In my workflows, I typically convert CSV to HDF or .npy first and create intermediate, binary cache files if needed to have short cycle times ("save .py, automated run (e.g. with entr), see results" should take only a few seconds). However, one nice thing about CSV is that it can be easily appended to, just by writing to the end of the file and at this point, I'd like to advertise my pull request a little more:

#20321

This would make .npy appendable in the same way CSV can be appended to: just write some more data to the end of the file, no need to update headers.

@xor2k
Copy link
Contributor

xor2k commented Jan 9, 2022

And I'd also like to add that most CSV parsers I've encountered (C, C++, JavaScript, Java) are simply broken and unmaintained. The one from Pandas is indeed the best I've seen so far, probably because of its large user base.

@mattip
Copy link
Member

mattip commented Jan 11, 2022

We have a ton of open loadtxt/genfromtxt issues

It might be nice to see which of the 60 open issues with loadtxt or genfromtxt are relevant

@seberg
Copy link
Member Author

seberg commented Jan 11, 2022

Frankly, I don't think it closes a lot. It closes some performance (speed/memory) relevant ones (although some of them mention "genfromtxt" as well):

It fixes/changes gh-19785 (as documented and we wanted to clean this part up slightly more), but I can also undo that. It also implements gh-9291, but it is just a small carrot if anything. A major thing is quoting support as e.g. requested in gh-2211 (but not for genfromtxt for now obviously).
I think there are a couple of others that may just not be listed anywhere, like:

  • Using float for int conversions is clearly incorrect for some values.
  • Better error messages

So does it fix a huge amount of issues? No, not really. It addresses the few big ones that meant that everyone is told not to use loadtxt: performance/memory use and CSV features like quoting – full features can now be achieved, such as escaping.
For most issues, moving it to C doesn't really change all that much about how hard/easy it would be to address them.

Warren had added support for things like decimal=",", I ripped it out because that chunk relied on vendoring 3000 lines of float parsing code and I prefer to start minimal. But that can be re-added also without vendoring, by adding it to a translation step (which really should not make a big difference on the scale of float parsing).

EDIT: There are a couple of rewordings and a few simple tests to be added. The only bigger thing is to review documentation. (There is one possibility that specifying things like quotechar="#", delimiter="#" can misbehave in some cases and should error explicit, although I somewhat doubt it.)

@seberg seberg force-pushed the add-npreadtext branch 4 times, most recently from a60b95c to ca2d6ee Compare January 13, 2022 21:19
@seberg seberg removed the triage review Issue/PR to be discussed at the next triage meeting label Jan 14, 2022
The `str` values for those weird values used for longdouble are
truncated by PyPy's complex `str` output.  Which seems fine probably
since PyPy's `repr` does the right thing and will not truncate.
Adds some tests for the behavior of control characters, e.g. comments, delimiter and quotechar, when they have the same value. At this stage, these tests are more to frame the discussion about what the behavior should be, not to test what it currently is. I personally think raising an exception is correct for most of these situations, though it's worth noting that np.loadtxt currently doesn't for most of these corner cases (and seems to randomly assign precedence to delimiter over comments or vice versa depending on the values).
Includes comments param, which is handled on the Python side.
@rossbar
Copy link
Contributor

rossbar commented Jan 28, 2022

The last of the small fixups that were queued up from previous reviews have been addressed, so this one is ready for review!

@mattip
Copy link
Member

mattip commented Feb 7, 2022

something went wrong with CI

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this today and it is ready for merging once CI is green. There will be some follow-on PRs but this looks good so far. Thanks @WarrenWeckesser, @rossbar and @seberg.

 - Floats with underscores
 - Floats + hex floats.
@seberg
Copy link
Member Author

seberg commented Feb 7, 2022

Thanks Matti! A typo (or two) snuck into the last docs update; fixed now. the remaining failure is due to the scipy docs (general doc CI failure currently).

There will be some follow-on PRs but this looks good so far.

Yes, the main potential follow-ups will be looking into:

  • removing that Python 2 compatibility encoding="bytes" shim.
  • renaming skip_rows to skip_lines to make it clearer.

(Besides any potential issue fixes of course. I will probably send a brief heads-up email to the mailing list once it lands.)

@mattip mattip merged commit 4b3ddd7 into numpy:main Feb 8, 2022
@mattip
Copy link
Member

mattip commented Feb 8, 2022

Thanks @seberg, @rossbar, @WarrenWeckesser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: lib: Incorrect result from loadtxt with a complicated dtype
8 participants