Skip to content

ENH: auto-type detect data loader #143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

chrisjordansquire
Copy link
Contributor

Created a function to auto-detect data types for each column in a csv-style data file. Works with arbitrary separators between the data and loads the data into a numpy record array.

This is an implementation of an idea of Mark Wiebe's for how to make a function to load data from a text file into a numpy record array by automatically testing the types. It also handles missing data, placing it into masked arrays. (A future update of this can add support for the NA feature.) It also handles dates.

The need this function fills is for users that want to be able to load their data quickly from a text file when the data contains missing values and is not homogenous. In this case a numpy record array seems like the most natural choice for the base package (though for more domain specific uses other data structures, such as in pandas, might be more appropriate). Unfortunately the current numpy functions for such reading of data are rather cumbersome. This function is much less cumbersome and includes sensible defaults. It is similar to matplotlib.mlab's csv2rec. This function isn't blazingly fast, and there's lots of potential avenues for improvement. However, it is perfectly usable for non-humongous datasets. For example, it loaded a fairly messy file of 2800 lines and 30 columns in about 1.3s, and the algorithm should scale linearly with the size of the input.

For the power-user, it's fairly simple to add support for additional types, such as more elaborate datetimes.

This also has a modest test suite to check functionality.

@pierregm
Copy link
Contributor

Mmh, would you mind stressing the differences w/ genfromtxt? Is it really
worth creating a new function?
On Aug 22, 2011 8:04 PM, "chrisjordansquire" <
reply@reply.github.com>
wrote:

Created a function to auto-detect data types for each column in a
csv-style data file. Works with arbitrary separators between the data and
loads the data into a numpy record array.

This is an implementation of an idea of Mark Wiebe's for how to make a
function to load data from a text file into a numpy record array by
automatically testing the types. It also handles missing data, placing it
into masked arrays. (A future update of this can add support for the NA
feature.) It also handles dates.

The need this function fills is for users that want to be able to load
their data quickly from a text file when the data contains missing values
and is not homogenous. In this case a numpy record array seems like the most
natural choice for the base package (though for more domain specific uses
other data structures, such as in pandas, might be more appropriate).
Unfortunately the current numpy functions for such reading of data are
rather cumbersome. This function is much less cumbersome and includes
sensible defaults. It is similar to matplotlib.mlab's csv2rec. This function
isn't blazingly fast, and there's lots of potential avenues for improvement.
However, it is perfectly usable for non-humongous datasets. For example, it
loaded a fairly messy file of 2800 lines and 30 columns in about 1.3s, and
the algorithm should scale linearly with the size of the input.

For the power-user, it's fairly simple to add support for additional
types, such as more elaborate datetimes.

This also has a modest test suite to check functionality.

Reply to this email directly or view it on GitHub:
#143

@rgommers
Copy link
Member

I had the same first reaction as Pierre. The new functionality looks useful, but would it be possible to infer the dtypes per column and then feed that to loadtxt/genfromtxt?

@chrisjordansquire
Copy link
Contributor Author

The main advantages of this function over genfromtxt, I think, are

  • More modular (genfromtxt is a rather large, nearly 500 line, monolithic function. No function in this is longer than around 80 lines, and they're fairly self-contained. This makes it easier for power users to make desired modifications.)
  • delimiters can be specified via regex's
  • missing data can be specified via regex's
  • it's bit simpler and has sensible defaults
  • it actually works on some (unfortunately proprietary) data that genfromtxt doesn't seem robust enough for
  • it supports datetimes
  • fairly extensible for the power user
  • makes two passes through the file, the first to determine types/sizes for strings and the second to read in the data, and pre-allocates the array for the second pass. So no giant memory bloating for reading large text files
  • fairly fast, though I think there is plenty of room for optimizations

All that said, it's entirely possible that the innards which determine the type should be ripped out and submitted as a function on their own.

I also just pushed some changes that added a bit of documentation, made a few performance enhancements, and fixed a small bug.

@rgommers
Copy link
Member

I do like this code better then genfromtxt, although I'm still worried somewhat about duplication. Did you send a message about this to the mailing list? There are a couple of people making regular improvements to loadtxt and genfromtxt who will be interested.

A comment on how the additions are organized: io is not a good module name because it's the name of a stdlib module. For scipy there's the same problem; we ended up recommending to import like import scipy.io as spio. And the python test files with expected output could probably better be put in a single file.

@rgommers
Copy link
Member

I do like the large number of test cases, nice job on that!

@chrisjordansquire
Copy link
Contributor Author

I'll email the general list tonight. Thanks for reminding me.

I actually think I need more test cases. Currently my tests only tell you if it breaks without giving much including about where or how. And it's a big enough method that it's a problem.

You mean put all the expected outputs inside the test file itself?

Placing it into io was Mark's idea. He was hoping it would be a stepping stone to later breaking up all of npyio.py into seperate method files instead of one huge file. I'd be happy to use a different file name, but I think Mark has the right idea of breaking up that huge file and putting its methods in a folder.

Based on some small performance testing I did, the loadtable function's big bottleneck is (1) checking for sizes if that's enabled and (2) reading in the data. The first can be disabled by the user, while the second can be replaced with a more specialized function or a C extension later on. One of the joys of explicit modularity.

I'd also welcome a name different from loadtable. I don't really like the name, but didn't know what else to call it. I was modeling parts of its expected behavior and API on R's read.table function, hence the name.

Code duplication definitely seems like a potential problem. loadtable isn't well developed enough that it could completely take the place of genfromtxt, but I think there is a potential future where it could. Until then I'm just hoping this is simpler to use and fulfills a niche that genfromtxt and its wrappers don't.

@rgommers
Copy link
Member

Or in a separate file. I just see a lot of 3-line long .py files, which feels a bit too Matlab-like:)

Some more modularity is good, it just shouldn't be called io.

@dhomeier
Copy link
Contributor

dhomeier commented Sep 2, 2011

I also find it much clearer to have the (StringIO) input and comparison data together in one test file.
The mode of pre-scanning the data makes this (among all the other functionality!) a complement to #144
As I posted to the list, automatic decompression and spaces as default delimiter would let this go more seamlessly with the existing npyio routines.
I could also think of a few more pet options:

enable Fortran-style exponentials (3.1415D+00 ;-) OK, probably easy enough to contribute in that framework

allow record names to be passed by the user - somewhat conflicting with the auto-detection, I admit, but I don't know of a way of changing the field names once the record array has been created.
This brings to my mind that your functionality is approaching some of the things provided by the asciitable package
http://cxc.harvard.edu/contrib/asciitable/
This can e.g. auto-read the field names from a header line, but its performance is way below loadtable (which comes actually very close to loadtxt with the prescan option). But it might be worth checking if the two projects could benefit from each other.

@chrisjordansquire
Copy link
Contributor Author

I'd been told about the asciitable project, and looked at it some. But it ultimately didn't seem relevant to the things I was thinking about at the time, so I didn't look into it further. (Since a lot of the stuff I was worrying about was recognizing types, constructing performant regular expressions--ha!-- and getting type promotion rules right.) If anyone has specific suggests for things I should look at in it, I'd happily give a second go.

As long as we're listing wants, two other options which I'd like this to have (in the 'and I wanna pony' sense) are:

  • read only specific columns (this could actually be added without much trouble once I have time and overcome laziness)
  • able to intelligently read types with each column, if the user is enlightened enough to specify them above/below the header so you don't need to auto-detect the types

@dhomeier
Copy link
Contributor

dhomeier commented Sep 2, 2011

It was just a vague idea for the moment - as asciitable also is already relatively mature, and other packages like ATPy depend on it, it would not be easily replaced by a new module. One idea for producing some synergy in the mid-range is that maybe the BaseReader of asciitable could be replaced by a more performant engine based on your work. But perhaps this is rather up to the asciitable developers to think about.

I'd also rank your first wanted above rather high, to provide an equivalent to the 'usecols' option of loadtxt/genfromtxt.

@chrisjordansquire
Copy link
Contributor Author

A note on an earlier comment. Numpy says it will be deprecated in the future, but you can change the names in a record array via, if x is a record array with 2 fields,

x.dtype.names = ('new_name1', 'new_name2')

@chrisjordansquire
Copy link
Contributor Author

I refactored the tests so it's all in one test file. I also changed the default delimiter from ',' to ' ', as suggested on the mailing list.

Chris Jordan-Squire added 6 commits September 6, 2011 09:53
Created a function to auto-detect data types for each column in a csv-style
data file. Works with arbitrary separators between the data and loads the
data into a numpy record array.
@chrisjordansquire
Copy link
Contributor Author

I added the ability to select only certain columns.

@mwiebe
Copy link
Member

mwiebe commented Sep 6, 2011

Would it overcomplicate things to try and automatically detect a delimiter by default, like some spreadsheet apps do? Maybe based on some heuristic regexes on the first line of data?

@chrisjordansquire
Copy link
Contributor Author

I'm not sure what the best heuristics would be. The most common delimiters, according to wikipedia, are whitespace (either tab or space), commas, colons, and pipes. So perhaps just doing a split on each of those and going with the one with the longest list would be appropriate.

But on the mailing list some people indicated they wanted whitespace as the default delimiter. And I kinda think the user should specify if they want some wonky heuristic used to determine the delimiter.

Another route would just be writing a few delimiter specific wrappers. Like a loadcsvtable, except with a non-ugly name.

I'm open to suggestions, as I can't think of a way to do it that really feels 'right'.

@dhomeier
Copy link
Contributor

dhomeier commented Sep 6, 2011

It would probably conflict with comma_decimals, right? Though that option is off by default.

@chrisjordansquire
Copy link
Contributor Author

Yes. That's part of why I'm not satisfied with that heuristic. Though it could be modified easily enough by automatically converting anything between quotes into some placeholder, such as 0. (I already do this at one point in the code.)

@mwiebe
Copy link
Member

mwiebe commented Sep 6, 2011

In some simple cases it seems obvious:

'1, 2, 3, 4' or '1,2,3,4' vs '1 2 3 4'

It's unlikely here that the first example should produce the strings "1," "2," 3," and the number 4 and the second example should produce the string "1,2,3,4". Things get more complicated when you have quoted strings, though, where a simple Python split() would not separate at the right place anyway:

'"A, B, C", 3.1, "(0 1) \" (2 3)"

which I would expect to produce the string "A, B, C", the number 3.1, and the string "(0 1), " (2 3)".

One approach for the heuristic would be to define an item regex (quoted string including the delimiter or characters all excluding the delimiter), then matching against ()+ to see how well the delimiter works. Probably more details would need to be worked out though.

@chrisjordansquire
Copy link
Contributor Author

I stuck np.lib._datasource in, just as in genfromtxt. It appears to be working, but I don't have a good idea how to test that. I'd appreciate suggestions.

@charris
Copy link
Member

charris commented Sep 9, 2011

If you would like to look through _datasource with an eye to simplifying it, I wouldn't complain ;) It seems a bit strange that a module that advertises itself as a generic file opener for the researcher looks like a private module.

@chrisjordansquire
Copy link
Contributor Author

Yes, I was confused by the _datasource code as well. In the end, I just used the same call that genfromtxt does and verified that it passed all my tests.

@charris
Copy link
Member

charris commented Oct 8, 2011

What is the status of this request? Are folks happy with the functionality and the addition of a new function?


Parameters
----------
f: file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This this a 'file' or 'file object'?

@bsouthey
Copy link

The code must work with Python2.4+!
For example, loadtable.py has this invalid Python2.4 syntax:

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "loadtable.py", line 953
    coltypes_input = [dtype_dict[t] if t not in
                                 ^
SyntaxError: invalid syntax

@bsouthey
Copy link

How do you obtain a specific format such as when you want integers treated as floats or vica versa?
In this example, the first column becomes an float so it requires extra processing to convert it to a integer type.
a,b,c,d
1,2,3,4
1.,3,4,5



def loadtable(fname,
delimiter=' ',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default delimiter must be 'None' that means whitespace because that is the default for loadtxt and genfromtxt.

@rgommers
Copy link
Member

Hi all, there are quite some things to fix in the current patch (good job on the detailed comments Bruce), but it looks like those are relatively straightforward. More important is to define what still has to be done in order for this to be merged. It would be good if we could come to a conclusion on that. It seems Pierre's take on this is that it should be able to replace genfromtxt. What's not completely clear to me is:

  • what exactly has to be done to cover all functionality
  • can loadtable be comparable speed-wise? or is this already the case?
  • if too much corner cases are not yet handled, is it possible to define an intermediate goal that allows us to merge this? either way genfromtxt will not be deprecated for a long time (if ever) I think

Besides the above points, it looks to me like there are a number of features already that set this apart from genfromtxt. Things like automatic type detection, handling datetimes and better memory performance. So in my view it's not a given that loadtable should replace genfromtxt.

Can we discuss this on this pull request? If ML is preferred that's fine too. But if we leave this sitting here without resolving these questions there's a good chance that the effort spent on this will be wasted.....

@bsouthey
Copy link

On 12/13/2011 03:14 PM, Ralf Gommers wrote:

Hi all, there are quite some things to fix in the current patch (good job on the detailed comments Bruce), but it looks like those are relatively straightforward.
I am trying to be objective because my view is that this has to have
something that is needed by the community as whole. Currently my
comments are just the easy and obvious things that come from reading the
code without understanding what the code can do. So the two issues that
become apparent are:

  1. The API should match, when possible, the genfromtxt or loadtxt API
    because that will enable easier user transition across functions. So I
    think the lack of user-specified 'column' names and lack of user-defined
    dtypes (especially if you want use say int8 or float128 instead of the
    'platform default') are the obvious major roadblocks.

  2. My really big hangup is that you need to be a regular expression guru
    to use it:

  • The simplest case is that the missing value argument needs to
    formulated as a regular expression so NA_re='.' does not work as
    expected (both in terms of the argument and outcome). That is a big
    issue for me because '.' is a valid missing value code in SAS and Stata
    and '*' is another 'common' missing value (genstat). Having to remember
    to always escape those is big problem.
  • I do not think that the delimiter should be a regular expression
    because the user should know exactly what it is.

More important is to define what still has to be done in order for this to be merged. It would be good if we could come to a conclusion on that. It seems Pierre's take on this is that it should be able to replace genfromtxt. What's not completely clear to me is:

  • what exactly has to be done to cover all functionality
  • can loadtable be comparable speed-wise? or is this already the case?
  • if too much corner cases are not yet handled, is it possible to define an intermediate goal that allows us to merge this? either way genfromtxt will not be deprecated for a long time (if ever) I think

Besides the above points, it looks to me like there are a number of features already that set this apart from genfromtxt. Things like automatic type detection, handling datetimes and better memory performance. So in my view it's not a given that loadtable should replace genfromtxt.
I thought that genfromtxt does automatic type detection and can handle
date/times especially in any user defined format. I do not know that it
has better memory performance as that is not easy to measure under
Python so code would be great to check that.

Can we discuss this on this pull request? If ML is preferred that's fine too. But if we leave this sitting here without resolving these questions there's a good chance that the effort spent on this will be wasted.....


Reply to this email directly or view it on GitHub:
#143 (comment)
Really I do find the approach intriguing but to replace loadtxt or
genfromtxt it needs to be shown to be 'significantly' faster while still
handling the data in the same manner as those functions. I have been
avoiding that as it is rather time consuming. The other 'advantages'
given are not real (such modularity) or are what I find as disadvantages
(like two passes through the data).

Ultimately, before it is included, I think the API aspect has to be
discussed on the mailing list as per Chris Barker's comment
(http://mail.scipy.org/pipermail/numpy-discussion/2011-December/059557.html)
about the genfromtxt API. Actually the real reason is that API's are
sort of 'set in stone' and become very hard to change after inclusion.

@rgommers
Copy link
Member

About the genfromtxt auto-detection, it's not nearly as good. I took the first example in the loadtable docstring, and with genfromtxt I need:

np.genfromtxt(f, delimiter=',', skiprows=3, dtype=None)

@rgommers
Copy link
Member

About modularity not being a real advantage (guess you would also include code quality in general?), I couldn't disagree more. Given our limited developer time, it's a very important benefit.

@rgommers
Copy link
Member

I agree with your first point, that the API should match where possible. You already identified quite a few easy changes to improve the current patch in this respect.

@bsouthey
Copy link

On 12/14/2011 12:47 PM, Ralf Gommers wrote:

About modularity not being a real advantage (guess you would also include code quality in general?), I couldn't disagree more. Given our limited developer time, it's a very important benefit.


Reply to this email directly or view it on GitHub:
#143 (comment)
This is about the modularity of this specific pull request. From what I
could tell, many 'functions' are used once ( init_type_search_order) or
even dubiously defined such as 'init_re_dict' that is just an if
statement hiding in a class. Hence I do not consider 'modularity' of
those functions being an advantage, rather it is a disadvantage as
Python needs extra resources to handle those classes even if those are
relatively minute.

@bsouthey
Copy link

On 12/14/2011 12:44 PM, Ralf Gommers wrote:

About the genfromtxt auto-detection, it's not nearly as good. I took the first example in the loadtable docstring, and with genfromtxt I need:

 np.genfromtxt(f, delimiter=',', skiprows=3, dtype=None)

Reply to this email directly or view it on GitHub:
#143 (comment)
The first example is delimited by both a comma and a space.

s = ''.join(['#Comment \n \n TempRange, Cloudy, AvgInchesRain,',
... 'Count\n 60to70, True, 0.3, 5\n 60to70, True, 4.3,',
... '3 \n 80to90, False, 0, 20'])

But the default delimiter in my version is a space so I get an error
without providing the delimiter=','. But I get the wrong output if I set
delimiter=','.

The second example is whitespace

s = ''.join(['//Comment \n \n TempRange Cloudy AvgInchesRain',
... ' Count\n #NA True "0,3" 5\n "60to70" True "4,3" 3 \n',
... '"80to90" False #NA 20'])

This gives no error but also does not give the desired output with
loadtable.

Really both docstring examples are very poor examples and the reported
output is wrong. The main error is that second 'empty' row can not be
skipped. There is a space in that line so that the first column is
either a space or empty/missing (depending on your delimiter) and rest
of the line empty. So this is essentially Ticket 1071
(http://projects.scipy.org/numpy/ticket/1071) so of course you have to
tell genfromtxt to skip that line.

@rgommers
Copy link
Member

Whether or not functions are used more than once is not relevant. Splitting them off like is done here is still far better than putting them all in a single huge function. The init_xx functions are on average about 10 lines long (plus doc/comments), putting them in the loadtable body would triple its length and make it far less readable.

Right now you can figure out in a few minutes how the logic in loadtable works. This is valuable. For genfromtxt I still wouldn't be able to tell you exactly.

... 'Count\n 60to70, True, 0.3, 5\n 60to70, True, 4.3,',
... '3 \n 80to90, False, 0, 20'])
>>> f = StringIO(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above example doesn't work. Delimiter is ', ' except for one value which is ','. Delimiter has to be specified to make it work for me. Delimiter auto-detection would be quite handy here.

@rgommers
Copy link
Member

The first example indeed doesn't work. I had just looked at the docstring and assumed that the output shown there is correct.

For example two, the blank line is the one separating header from data columns, so I don't see a problem here. It's quite useful to be able to say header=True instead of having to count the number of lines to skip by hand.

@charris
Copy link
Member

charris commented May 11, 2013

@rgommers @bsouthey I'm going to close this unless there is resistance. It sounds like the functionality might be good, but the development has stalled. Does someone else want to pick this up?

@rgommers
Copy link
Member

Unfortunate that this has stalled, but closing it makes sense. We can't leave PRs open forever. If someone wants to pick this up and process all review comments, I'm still in favor of merging it.

@rgommers rgommers closed this May 19, 2013
luyahan pushed a commit to plctlab/numpy that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants