-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: auto-type detect data loader #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Mmh, would you mind stressing the differences w/ genfromtxt? Is it really
|
I had the same first reaction as Pierre. The new functionality looks useful, but would it be possible to infer the dtypes per column and then feed that to loadtxt/genfromtxt? |
The main advantages of this function over genfromtxt, I think, are
All that said, it's entirely possible that the innards which determine the type should be ripped out and submitted as a function on their own. I also just pushed some changes that added a bit of documentation, made a few performance enhancements, and fixed a small bug. |
I do like this code better then genfromtxt, although I'm still worried somewhat about duplication. Did you send a message about this to the mailing list? There are a couple of people making regular improvements to loadtxt and genfromtxt who will be interested. A comment on how the additions are organized: |
I do like the large number of test cases, nice job on that! |
I'll email the general list tonight. Thanks for reminding me. I actually think I need more test cases. Currently my tests only tell you if it breaks without giving much including about where or how. And it's a big enough method that it's a problem. You mean put all the expected outputs inside the test file itself? Placing it into io was Mark's idea. He was hoping it would be a stepping stone to later breaking up all of npyio.py into seperate method files instead of one huge file. I'd be happy to use a different file name, but I think Mark has the right idea of breaking up that huge file and putting its methods in a folder. Based on some small performance testing I did, the loadtable function's big bottleneck is (1) checking for sizes if that's enabled and (2) reading in the data. The first can be disabled by the user, while the second can be replaced with a more specialized function or a C extension later on. One of the joys of explicit modularity. I'd also welcome a name different from loadtable. I don't really like the name, but didn't know what else to call it. I was modeling parts of its expected behavior and API on R's read.table function, hence the name. Code duplication definitely seems like a potential problem. loadtable isn't well developed enough that it could completely take the place of genfromtxt, but I think there is a potential future where it could. Until then I'm just hoping this is simpler to use and fulfills a niche that genfromtxt and its wrappers don't. |
Or in a separate file. I just see a lot of 3-line long .py files, which feels a bit too Matlab-like:) Some more modularity is good, it just shouldn't be called |
I also find it much clearer to have the (StringIO) input and comparison data together in one test file. enable Fortran-style exponentials (3.1415D+00 ;-) OK, probably easy enough to contribute in that framework allow record names to be passed by the user - somewhat conflicting with the auto-detection, I admit, but I don't know of a way of changing the field names once the record array has been created. |
I'd been told about the asciitable project, and looked at it some. But it ultimately didn't seem relevant to the things I was thinking about at the time, so I didn't look into it further. (Since a lot of the stuff I was worrying about was recognizing types, constructing performant regular expressions--ha!-- and getting type promotion rules right.) If anyone has specific suggests for things I should look at in it, I'd happily give a second go. As long as we're listing wants, two other options which I'd like this to have (in the 'and I wanna pony' sense) are:
|
It was just a vague idea for the moment - as asciitable also is already relatively mature, and other packages like ATPy depend on it, it would not be easily replaced by a new module. One idea for producing some synergy in the mid-range is that maybe the BaseReader of asciitable could be replaced by a more performant engine based on your work. But perhaps this is rather up to the asciitable developers to think about. I'd also rank your first wanted above rather high, to provide an equivalent to the 'usecols' option of loadtxt/genfromtxt. |
A note on an earlier comment. Numpy says it will be deprecated in the future, but you can change the names in a record array via, if x is a record array with 2 fields, x.dtype.names = ('new_name1', 'new_name2') |
I refactored the tests so it's all in one test file. I also changed the default delimiter from ',' to ' ', as suggested on the mailing list. |
Created a function to auto-detect data types for each column in a csv-style data file. Works with arbitrary separators between the data and loads the data into a numpy record array.
I added the ability to select only certain columns. |
Would it overcomplicate things to try and automatically detect a delimiter by default, like some spreadsheet apps do? Maybe based on some heuristic regexes on the first line of data? |
I'm not sure what the best heuristics would be. The most common delimiters, according to wikipedia, are whitespace (either tab or space), commas, colons, and pipes. So perhaps just doing a split on each of those and going with the one with the longest list would be appropriate. But on the mailing list some people indicated they wanted whitespace as the default delimiter. And I kinda think the user should specify if they want some wonky heuristic used to determine the delimiter. Another route would just be writing a few delimiter specific wrappers. Like a loadcsvtable, except with a non-ugly name. I'm open to suggestions, as I can't think of a way to do it that really feels 'right'. |
It would probably conflict with comma_decimals, right? Though that option is off by default. |
Yes. That's part of why I'm not satisfied with that heuristic. Though it could be modified easily enough by automatically converting anything between quotes into some placeholder, such as 0. (I already do this at one point in the code.) |
In some simple cases it seems obvious:
It's unlikely here that the first example should produce the strings "1," "2," 3," and the number 4 and the second example should produce the string "1,2,3,4". Things get more complicated when you have quoted strings, though, where a simple Python split() would not separate at the right place anyway:
which I would expect to produce the string "A, B, C", the number 3.1, and the string "(0 1), " (2 3)". One approach for the heuristic would be to define an item regex (quoted string including the delimiter or characters all excluding the delimiter), then matching against ()+ to see how well the delimiter works. Probably more details would need to be worked out though. |
I stuck np.lib._datasource in, just as in genfromtxt. It appears to be working, but I don't have a good idea how to test that. I'd appreciate suggestions. |
If you would like to look through _datasource with an eye to simplifying it, I wouldn't complain ;) It seems a bit strange that a module that advertises itself as a generic file opener for the researcher looks like a private module. |
Yes, I was confused by the _datasource code as well. In the end, I just used the same call that genfromtxt does and verified that it passed all my tests. |
What is the status of this request? Are folks happy with the functionality and the addition of a new function? |
|
||
Parameters | ||
---------- | ||
f: file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This this a 'file' or 'file object'?
The code must work with Python2.4+!
|
How do you obtain a specific format such as when you want integers treated as floats or vica versa? |
|
||
|
||
def loadtable(fname, | ||
delimiter=' ', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default delimiter must be 'None' that means whitespace because that is the default for loadtxt and genfromtxt.
Hi all, there are quite some things to fix in the current patch (good job on the detailed comments Bruce), but it looks like those are relatively straightforward. More important is to define what still has to be done in order for this to be merged. It would be good if we could come to a conclusion on that. It seems Pierre's take on this is that it should be able to replace
Besides the above points, it looks to me like there are a number of features already that set this apart from genfromtxt. Things like automatic type detection, handling datetimes and better memory performance. So in my view it's not a given that loadtable should replace genfromtxt. Can we discuss this on this pull request? If ML is preferred that's fine too. But if we leave this sitting here without resolving these questions there's a good chance that the effort spent on this will be wasted..... |
On 12/13/2011 03:14 PM, Ralf Gommers wrote:
Ultimately, before it is included, I think the API aspect has to be |
About the genfromtxt auto-detection, it's not nearly as good. I took the first example in the loadtable docstring, and with
|
About modularity not being a real advantage (guess you would also include code quality in general?), I couldn't disagree more. Given our limited developer time, it's a very important benefit. |
I agree with your first point, that the API should match where possible. You already identified quite a few easy changes to improve the current patch in this respect. |
On 12/14/2011 12:47 PM, Ralf Gommers wrote:
|
On 12/14/2011 12:44 PM, Ralf Gommers wrote:
But the default delimiter in my version is a space so I get an error The second example is whitespace
This gives no error but also does not give the desired output with Really both docstring examples are very poor examples and the reported |
Whether or not functions are used more than once is not relevant. Splitting them off like is done here is still far better than putting them all in a single huge function. The Right now you can figure out in a few minutes how the logic in |
... 'Count\n 60to70, True, 0.3, 5\n 60to70, True, 4.3,', | ||
... '3 \n 80to90, False, 0, 20']) | ||
>>> f = StringIO(s) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above example doesn't work. Delimiter is ', '
except for one value which is ','
. Delimiter has to be specified to make it work for me. Delimiter auto-detection would be quite handy here.
The first example indeed doesn't work. I had just looked at the docstring and assumed that the output shown there is correct. For example two, the blank line is the one separating header from data columns, so I don't see a problem here. It's quite useful to be able to say |
Unfortunate that this has stalled, but closing it makes sense. We can't leave PRs open forever. If someone wants to pick this up and process all review comments, I'm still in favor of merging it. |
refactor: Optimize vpadd_s8
Created a function to auto-detect data types for each column in a csv-style data file. Works with arbitrary separators between the data and loads the data into a numpy record array.
This is an implementation of an idea of Mark Wiebe's for how to make a function to load data from a text file into a numpy record array by automatically testing the types. It also handles missing data, placing it into masked arrays. (A future update of this can add support for the NA feature.) It also handles dates.
The need this function fills is for users that want to be able to load their data quickly from a text file when the data contains missing values and is not homogenous. In this case a numpy record array seems like the most natural choice for the base package (though for more domain specific uses other data structures, such as in pandas, might be more appropriate). Unfortunately the current numpy functions for such reading of data are rather cumbersome. This function is much less cumbersome and includes sensible defaults. It is similar to matplotlib.mlab's csv2rec. This function isn't blazingly fast, and there's lots of potential avenues for improvement. However, it is perfectly usable for non-humongous datasets. For example, it loaded a fairly messy file of 2800 lines and 30 columns in about 1.3s, and the algorithm should scale linearly with the size of the input.
For the power-user, it's fairly simple to add support for additional types, such as more elaborate datetimes.
This also has a modest test suite to check functionality.