-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+2]: Svmlight chunk loader #935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2]: Svmlight chunk loader #935
Conversation
The issue probably comes from the fact that the serialization uses We should check that libsvm can accept such long values as input format though. |
I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to (Also, how large are your matrices? I was under the impression that you can never store more than |
I would like to be able
I would like to be able to deal with matrices of the scale of the PASCAL large scale challenge:
This indeed won't fit on a single CSR:
That's unfortunate but I guess I can split it into 10 CSR chunks or so and then use the Would be great to have support for |
Splitting lines is CPU bound (at least |
Actually, |
Interesting / useful contrib! It could be useful to have a way to estimate the actual size once a file chunk has been converted to CSR format. Minor remark: for me, it would be more natural to use |
Then |
Ok I rebased, added more tests and fixed a bug. I also switched to @mblondel suggested API (offset + length). I think this is ready for merge to master. WDYT? |
|
||
|
||
def load_svmlight_files(files, n_features=None, dtype=np.float64, | ||
multilabel=False, zero_based="auto", query_id=False): | ||
multilabel=False, zero_based="auto", query_id=False, | ||
offset=0, length=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that load_svmlight_file
is implemented by calling the load_svmlight_files
function. Maybe I should just non document the parameters in the load_svmlight_files
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?
I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs. |
You assume that |
I was thinking to write a parallel, out-of-core conversion tool that would generate joblib dumps of chunked data in a folder instead. Do you think both would be useful? |
For my conversion tool I want to do a single parsing pass over the data and record the |
Inferring n_features seems a bit expensive. We could reproject the data while it is loaded from the svmlight file using a FeatureHasher. This way, n_features can be safely fixed. Another thing I would like to check is whether a crude upper bound on n_features would work. The training time of solvers like SGDClassifier or LinearSVC with dual=True is not affected by the number of features (the training time of CD-based solvers is). |
My goal is to convert the svmlight file into a contiguous datastructure save on the hard drive once a never have to parse the svmlight file again. It's a dataset format conversion tool. I don't want to do learning on the fly in my case. |
I simply use joblib for these purposes :) |
Yes but in this case I need to do it out of core as the svmlight file is 11GB and the dense representation is twice as big (without compression) and I want to detect the number of features on the fly so I need to do a second non-parsing pass to pad the previously extracted arrays with zero features. |
The svmlight format is very useful for preparing the data in one programming language and learning the model in another. It's really easy to write a script for outputting data to this format. |
I think the chunk loading support can be merged as it is. It's already useful for advanced users. I am not sure we want to implement a generic out-of-core converter in the library. Maybe it would be better implemented as an example in a benchmark script based on the mnist8m dataset. I will do that in another PR later. |
Could you rebase? |
8d02541
to
8edcc30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should explicitly test boundary cases:
f.seek(offset)
such thatf.read(1) == '\n'
f.seek(length)
such thatf.read(1)) == '\n'
f.seek(length - 1)
such thatf.read(1) == '\n'
sklearn/datasets/svmlight_format.py
Outdated
discarding the following bytes up until the next new line | ||
character. | ||
|
||
length: integer, optional, default -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before colon
sklearn/datasets/svmlight_format.py
Outdated
discarding the following bytes up until the next new line | ||
character. | ||
|
||
length: integer, optional, default -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before colon
f943c65
to
b09c135
Compare
@jnothman thanks for the review. I added a test to check for all the possible of the byte offset of small dataset (along with query ids). The exhaustive test run in 500ms. This should cover all the boundary cases you mentioned. |
b09c135
to
577f51f
Compare
@jnothman I fixed (workaround) the broken tests with old versions of scipy. |
Mind adding a what's new before merge, @ogrisel? |
577f51f
to
3d21127
Compare
@jnothman done. I also rebased on top of current master to insert the entry at the right location. Let's see if CI is still green. |
3d21127
to
577f51f
Compare
577f51f
to
70a04c6
Compare
Merged! The scipy version check in the test was to lax. I updated it. |
I'm tempted to say: Thanks for your patience, @ogrisel ;) |
It's pretty exciting to close an issue #<1000 |
Hi all,
I am working on an incremental data loader for the svmlight format that reads chunks of a big file that is not expected to all fit in memory in smaller CSR matrix to be dumped as set of memmapable files in folders to be later reconcatenated into a single, large CSR memmaped matrix.
The goal being to be able to load big svmlight files (multiple tens of GB) into an efficient memmaped CSR in an out-of-core manner (possible using several workers in //).
The first step is to augment the existing parser to be able to load chunks of a svmlight using seeks to bytes offsets.
Edit: the scope of this PR has changed. It is now just about loading a chunk (given by byte offset and length) of a large svmlight file as CSR matrix that fits in RAM. This would make it possible to efficiently load and parse a large svmlight file by workers on PySpark or dask distributed for instance.