[MRG+2]: Svmlight chunk loader #935

ogrisel · 2012-07-05T23:15:59Z

Hi all,

I am working on an incremental data loader for the svmlight format that reads chunks of a big file that is not expected to all fit in memory in smaller CSR matrix to be dumped as set of memmapable files in folders to be later reconcatenated into a single, large CSR memmaped matrix.

The goal being to be able to load big svmlight files (multiple tens of GB) into an efficient memmaped CSR in an out-of-core manner (possible using several workers in //).

The first step is to augment the existing parser to be able to load chunks of a svmlight using seeks to bytes offsets.

Edit: the scope of this PR has changed. It is now just about loading a chunk (given by byte offset and length) of a large svmlight file as CSR matrix that fits in RAM. This would make it possible to efficiently load and parse a large svmlight file by workers on PySpark or dask distributed for instance.

ogrisel · 2012-07-06T08:36:15Z

The issue probably comes from the fact that the serialization uses %f for the values which truncates to 6 places by default whatever the dtype of the input. For np.float64 we should probably use something like %0.16f instead. WDTY?

We should check that libsvm can accept such long values as input format though.

larsmans · 2012-07-06T08:49:12Z

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

ogrisel · 2012-07-06T09:53:29Z

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

I would like to be able os.stat a big svmlight file (several GB) and divide it into mt_size / n_chunks that would fit in memory to be parsed by n_workers in // without having to do a first sequential scan to count the lines and find the bytes offsets. Workers would then dump intermediate CSR data struct on the filesystem and a second pass would just aggregate them all into a single memmapped CSR with n_features being the max of the observed n_features on the indivual chunks.

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

I would like to be able to deal with matrices of the scale of the PASCAL large scale challenge:

$ time wc ~/Desktop/alpha/alpha.txt 
  500000 250500000 3590403790 /Users/oliviergrisel/Desktop/alpha/alpha.txt

real    0m27.502s
user    0m24.754s
sys 0m2.207s

This indeed won't fit on a single CSR:

>>> 3590403790. / 2 ** 31
1.6719120508059859

That's unfortunate but I guess I can split it into 10 CSR chunks or so and then use the partial_fit method of the SGDClassifier class (for instance) to deal with this limitation.

Would be great to have support for np.int64 indices in scipy.sparse though...

ogrisel · 2012-07-06T09:56:26Z

Splitting lines is CPU bound (at least wc -l is). Hence being able to seek forward several workers on a multicore machine should bring a speed up.

ogrisel · 2012-07-06T10:01:45Z

Actually, wc -l is IO bound on my laptop (with SSD) but wc (parsing more similar to our svmlight parser) is CPU bound.

mblondel · 2012-07-08T10:11:40Z

Interesting / useful contrib!

It could be useful to have a way to estimate the actual size once a file chunk has been converted to CSR format.

Minor remark: for me, it would be more natural to use offset and length parameters rather than offset_min and offset_max.

ogrisel · 2012-07-08T10:15:33Z

Then length would be equivalent to offset_max - offset_min? Why not. I might do the change later today.

ogrisel · 2013-07-31T20:14:05Z

Ok I rebased, added more tests and fixed a bug. I also switched to @mblondel suggested API (offset + length). I think this is ready for merge to master. WDYT?

mblondel · 2013-08-01T00:58:22Z

sklearn/datasets/svmlight_format.py



 def load_svmlight_files(files, n_features=None, dtype=np.float64,
-                        multilabel=False, zero_based="auto", query_id=False):
+                        multilabel=False, zero_based="auto", query_id=False,
+                        offset=0, length=-1):


I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?

The problem is that load_svmlight_file is implemented by calling the load_svmlight_files function. Maybe I should just non document the parameters in the load_svmlight_files function.

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

mblondel · 2013-08-01T01:01:15Z

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

mblondel · 2013-08-01T01:03:51Z

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

ogrisel · 2013-08-02T11:14:11Z

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

I was thinking to write a parallel, out-of-core conversion tool that would generate joblib dumps of chunked data in a folder instead. Do you think both would be useful?

ogrisel · 2013-08-02T11:33:47Z

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

For my conversion tool I want to do a single parsing pass over the data and record the n_features detected in each chunk, take the max and re-save the datasets iteratively by padding in non-parsing hence fast second pass over the previously extracted chunks.

mblondel · 2013-08-02T11:49:58Z

Inferring n_features seems a bit expensive.

We could reproject the data while it is loaded from the svmlight file using a FeatureHasher. This way, n_features can be safely fixed.

Another thing I would like to check is whether a crude upper bound on n_features would work. The training time of solvers like SGDClassifier or LinearSVC with dual=True is not affected by the number of features (the training time of CD-based solvers is).

ogrisel · 2013-08-02T14:40:12Z

My goal is to convert the svmlight file into a contiguous datastructure save on the hard drive once a never have to parse the svmlight file again. It's a dataset format conversion tool. I don't want to do learning on the fly in my case.

GaelVaroquaux · 2013-08-02T15:04:30Z

I simply use joblib for these purposes :)

ogrisel · 2013-08-03T08:19:51Z

Yes but in this case I need to do it out of core as the svmlight file is 11GB and the dense representation is twice as big (without compression) and I want to detect the number of features on the fly so I need to do a second non-parsing pass to pad the previously extracted arrays with zero features.

mblondel · 2013-08-03T11:38:37Z

The svmlight format is very useful for preparing the data in one programming language and learning the model in another. It's really easy to write a script for outputting data to this format.

ogrisel · 2013-08-20T10:40:43Z

I think the chunk loading support can be merged as it is. It's already useful for advanced users. I am not sure we want to implement a generic out-of-core converter in the library. Maybe it would be better implemented as an example in a benchmark script based on the mnist8m dataset. I will do that in another PR later.

raghavrv · 2016-11-04T12:06:15Z

Could you rebase?

lesteve

LGTM

jnothman

I think you should explicitly test boundary cases:

f.seek(offset) such that f.read(1) == '\n'
f.seek(length) such that f.read(1)) == '\n'
f.seek(length - 1) such that f.read(1) == '\n'

jnothman · 2017-02-21T10:59:10Z

sklearn/datasets/svmlight_format.py

+        discarding the following bytes up until the next new line
+        character.
+
+    length: integer, optional, default -1


space before colon

jnothman · 2017-02-21T11:02:30Z

sklearn/datasets/svmlight_format.py

+        discarding the following bytes up until the next new line
+        character.
+
+    length: integer, optional, default -1


space before colon

ogrisel · 2017-02-21T18:02:10Z

@jnothman thanks for the review. I added a test to check for all the possible of the byte offset of small dataset (along with query ids). The exhaustive test run in 500ms. This should cover all the boundary cases you mentioned.

ogrisel · 2017-02-21T19:49:38Z

@jnothman I fixed (workaround) the broken tests with old versions of scipy.

jnothman · 2017-06-14T12:07:30Z

Mind adding a what's new before merge, @ogrisel?

ogrisel · 2017-06-14T14:55:55Z

@jnothman done. I also rebased on top of current master to insert the entry at the right location. Let's see if CI is still green.

ogrisel · 2017-06-16T16:58:19Z

Merged! The scipy version check in the test was to lax. I updated it.

jnothman · 2017-06-17T10:47:42Z

I'm tempted to say: Thanks for your patience, @ogrisel ;)

jnothman · 2017-06-17T10:48:15Z

It's pretty exciting to close an issue #<1000

ogrisel mentioned this pull request Jul 7, 2012

svmlight serializer does not preserve precision #937

Closed

ghost assigned ogrisel Jul 31, 2013

mblondel reviewed Aug 1, 2013
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Waiting for Reviewer label Dec 10, 2015

amueller modified the milestone: 0.15 Sep 29, 2016

raghavrv added this to the 0.19 milestone Nov 4, 2016

raghavrv removed the Waiting for Reviewer label Nov 4, 2016

ogrisel force-pushed the svmlight-memmaped-loader branch from 8d02541 to 8edcc30 Compare February 14, 2017 12:31

lesteve approved these changes Feb 21, 2017

View reviewed changes

lesteve changed the title ~~[MRG]: Svmlight chunk loader~~ [MRG+1]: Svmlight chunk loader Feb 21, 2017

jnothman reviewed Feb 21, 2017

View reviewed changes

ogrisel force-pushed the svmlight-memmaped-loader branch from f943c65 to b09c135 Compare February 21, 2017 18:00

ogrisel force-pushed the svmlight-memmaped-loader branch from b09c135 to 577f51f Compare February 21, 2017 18:56

jnothman changed the title ~~[MRG+1]: Svmlight chunk loader~~ [MRG+2]: Svmlight chunk loader Jun 14, 2017

scikit-learn deleted a comment from codecov-io Jun 14, 2017

ogrisel force-pushed the svmlight-memmaped-loader branch from 577f51f to 3d21127 Compare June 14, 2017 14:54

scikit-learn deleted a comment from codecov bot Jun 16, 2017

ogrisel added 2 commits June 16, 2017 16:44

Chunked data loading for load_svmlight_file

e155f96

DOC add whats_new entry for chunked load_svmlight_file

70a04c6

ogrisel force-pushed the svmlight-memmaped-loader branch from 3d21127 to 577f51f Compare June 16, 2017 14:45

scikit-learn deleted a comment from codecov bot Jun 16, 2017

ogrisel force-pushed the svmlight-memmaped-loader branch from 577f51f to 70a04c6 Compare June 16, 2017 14:47

ogrisel merged commit a39c8ab into scikit-learn:master Jun 16, 2017

ogrisel deleted the svmlight-memmaped-loader branch June 16, 2017 16:11

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

ENH svmlight chunk loader (scikit-learn#935)

89a1427

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

ENH svmlight chunk loader (scikit-learn#935)

fa792d5

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

ENH svmlight chunk loader (scikit-learn#935)

5b0cdff

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

ENH svmlight chunk loader (scikit-learn#935)

2e58d36

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

ENH svmlight chunk loader (scikit-learn#935)

aa6b74b

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

ENH svmlight chunk loader (scikit-learn#935)

b639c6c

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

ENH svmlight chunk loader (scikit-learn#935)

526fbdb

Uh oh!

[MRG+2]: Svmlight chunk loader #935

[MRG+2]: Svmlight chunk loader #935

Uh oh!

Conversation

ogrisel commented Jul 5, 2012 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 6, 2012

Uh oh!

larsmans commented Jul 6, 2012

Uh oh!

ogrisel commented Jul 6, 2012

Uh oh!

ogrisel commented Jul 6, 2012

Uh oh!

ogrisel commented Jul 6, 2012

Uh oh!

mblondel commented Jul 8, 2012

Uh oh!

ogrisel commented Jul 8, 2012

Uh oh!

ogrisel commented Jul 31, 2013

Uh oh!

mblondel Aug 1, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 2, 2013

Choose a reason for hiding this comment

Uh oh!

lesteve Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

mblondel commented Aug 1, 2013

Uh oh!

mblondel commented Aug 1, 2013

Uh oh!

ogrisel commented Aug 2, 2013

Uh oh!

ogrisel commented Aug 2, 2013

Uh oh!

mblondel commented Aug 2, 2013

Uh oh!

ogrisel commented Aug 2, 2013

Uh oh!

GaelVaroquaux commented Aug 2, 2013

Uh oh!

ogrisel commented Aug 3, 2013

Uh oh!

mblondel commented Aug 3, 2013

Uh oh!

ogrisel commented Aug 20, 2013

Uh oh!

raghavrv commented Nov 4, 2016

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 21, 2017

Uh oh!

ogrisel commented Feb 21, 2017

Uh oh!

jnothman commented Jun 14, 2017

Uh oh!

ogrisel commented Jun 14, 2017

Uh oh!

ogrisel commented Jun 16, 2017

Uh oh!

jnothman commented Jun 17, 2017

Uh oh!

jnothman commented Jun 17, 2017

ogrisel commented Jul 5, 2012 •

edited

Loading