-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+2]: Svmlight chunk loader #935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ogrisel
merged 2 commits into
scikit-learn:master
from
ogrisel:svmlight-memmaped-loader
Jun 16, 2017
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,7 +31,8 @@ | |
|
||
|
||
def load_svmlight_file(f, n_features=None, dtype=np.float64, | ||
multilabel=False, zero_based="auto", query_id=False): | ||
multilabel=False, zero_based="auto", query_id=False, | ||
offset=0, length=-1): | ||
"""Load datasets in the svmlight / libsvm format into sparse CSR matrix | ||
|
||
This format is a text-based format, with one sample per line. It does | ||
|
@@ -76,6 +77,8 @@ def load_svmlight_file(f, n_features=None, dtype=np.float64, | |
bigger sliced dataset: each subset might not have examples of | ||
every feature, hence the inferred shape might vary from one | ||
slice to another. | ||
n_features is only required if ``offset`` or ``length`` are passed a | ||
non-default value. | ||
|
||
multilabel : boolean, optional, default False | ||
Samples may have several labels each (see | ||
|
@@ -88,7 +91,10 @@ def load_svmlight_file(f, n_features=None, dtype=np.float64, | |
If set to "auto", a heuristic check is applied to determine this from | ||
the file contents. Both kinds of files occur "in the wild", but they | ||
are unfortunately not self-identifying. Using "auto" or True should | ||
always be safe. | ||
always be safe when no ``offset`` or ``length`` is passed. | ||
If ``offset`` or ``length`` are passed, the "auto" mode falls back | ||
to ``zero_based=True`` to avoid having the heuristic check yield | ||
inconsistent results on different segments of the file. | ||
|
||
query_id : boolean, default False | ||
If True, will return the query_id array for each file. | ||
|
@@ -97,6 +103,15 @@ def load_svmlight_file(f, n_features=None, dtype=np.float64, | |
Data type of dataset to be loaded. This will be the data type of the | ||
output numpy arrays ``X`` and ``y``. | ||
|
||
offset : integer, optional, default 0 | ||
Ignore the offset first bytes by seeking forward, then | ||
discarding the following bytes up until the next new line | ||
character. | ||
|
||
length : integer, optional, default -1 | ||
If strictly positive, stop reading any new line of data once the | ||
position in the file has reached the (offset + length) bytes threshold. | ||
|
||
Returns | ||
------- | ||
X : scipy.sparse matrix of shape (n_samples, n_features) | ||
|
@@ -129,7 +144,7 @@ def get_data(): | |
X, y = get_data() | ||
""" | ||
return tuple(load_svmlight_files([f], n_features, dtype, multilabel, | ||
zero_based, query_id)) | ||
zero_based, query_id, offset, length)) | ||
|
||
|
||
def _gen_open(f): | ||
|
@@ -149,15 +164,18 @@ def _gen_open(f): | |
return open(f, "rb") | ||
|
||
|
||
def _open_and_load(f, dtype, multilabel, zero_based, query_id): | ||
def _open_and_load(f, dtype, multilabel, zero_based, query_id, | ||
offset=0, length=-1): | ||
if hasattr(f, "read"): | ||
actual_dtype, data, ind, indptr, labels, query = \ | ||
_load_svmlight_file(f, dtype, multilabel, zero_based, query_id) | ||
_load_svmlight_file(f, dtype, multilabel, zero_based, query_id, | ||
offset, length) | ||
# XXX remove closing when Python 2.7+/3.1+ required | ||
else: | ||
with closing(_gen_open(f)) as f: | ||
actual_dtype, data, ind, indptr, labels, query = \ | ||
_load_svmlight_file(f, dtype, multilabel, zero_based, query_id) | ||
_load_svmlight_file(f, dtype, multilabel, zero_based, query_id, | ||
offset, length) | ||
|
||
# convert from array.array, give data the right dtype | ||
if not multilabel: | ||
|
@@ -172,7 +190,8 @@ def _open_and_load(f, dtype, multilabel, zero_based, query_id): | |
|
||
|
||
def load_svmlight_files(files, n_features=None, dtype=np.float64, | ||
multilabel=False, zero_based="auto", query_id=False): | ||
multilabel=False, zero_based="auto", query_id=False, | ||
offset=0, length=-1): | ||
"""Load dataset from multiple files in SVMlight format | ||
|
||
This function is equivalent to mapping load_svmlight_file over a list of | ||
|
@@ -216,7 +235,10 @@ def load_svmlight_files(files, n_features=None, dtype=np.float64, | |
If set to "auto", a heuristic check is applied to determine this from | ||
the file contents. Both kinds of files occur "in the wild", but they | ||
are unfortunately not self-identifying. Using "auto" or True should | ||
always be safe. | ||
always be safe when no offset or length is passed. | ||
If offset or length are passed, the "auto" mode falls back | ||
to zero_based=True to avoid having the heuristic check yield | ||
inconsistent results on different segments of the file. | ||
|
||
query_id : boolean, defaults to False | ||
If True, will return the query_id array for each file. | ||
|
@@ -225,6 +247,15 @@ def load_svmlight_files(files, n_features=None, dtype=np.float64, | |
Data type of dataset to be loaded. This will be the data type of the | ||
output numpy arrays ``X`` and ``y``. | ||
|
||
offset : integer, optional, default 0 | ||
Ignore the offset first bytes by seeking forward, then | ||
discarding the following bytes up until the next new line | ||
character. | ||
|
||
length : integer, optional, default -1 | ||
If strictly positive, stop reading any new line of data once the | ||
position in the file has reached the (offset + length) bytes threshold. | ||
|
||
Returns | ||
------- | ||
[X1, y1, ..., Xn, yn] | ||
|
@@ -245,16 +276,27 @@ def load_svmlight_files(files, n_features=None, dtype=np.float64, | |
-------- | ||
load_svmlight_file | ||
""" | ||
r = [_open_and_load(f, dtype, multilabel, bool(zero_based), bool(query_id)) | ||
if (offset != 0 or length > 0) and zero_based == "auto": | ||
# disable heuristic search to avoid getting inconsistent results on | ||
# different segments of the file | ||
zero_based = True | ||
|
||
if (offset != 0 or length > 0) and n_features is None: | ||
raise ValueError( | ||
"n_features is required when offset or length is specified.") | ||
|
||
r = [_open_and_load(f, dtype, multilabel, bool(zero_based), bool(query_id), | ||
offset=offset, length=length) | ||
for f in files] | ||
|
||
if (zero_based is False | ||
or zero_based == "auto" and all(np.min(tmp[1]) > 0 for tmp in r)): | ||
for ind in r: | ||
indices = ind[1] | ||
if (zero_based is False or | ||
zero_based == "auto" and all(len(tmp[1]) and np.min(tmp[1]) > 0 | ||
for tmp in r)): | ||
for _, indices, _, _, _ in r: | ||
indices -= 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This works because indices is an array, I was really confused for a moment ... |
||
|
||
n_f = max(ind[1].max() for ind in r) + 1 | ||
n_f = max(ind[1].max() if len(ind[1]) else 0 for ind in r) + 1 | ||
|
||
if n_features is None: | ||
n_features = n_f | ||
elif n_features < n_f: | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that
load_svmlight_file
is implemented by calling theload_svmlight_files
function. Maybe I should just non document the parameters in theload_svmlight_files
function.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?