-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd rather leave this until after release. ping then
…On 18 Jun 2017 4:39 am, "Roman Yurchak" ***@***.***> wrote:
Reference Issue
Continuation of #6473
<#6473> (itself
continuation of #6194
<#6194>). Fixes #6183
<#6183> (in
CountVectorizer); also fixes #8941
<#8941> (in
HashingVectorizer).
What does this implement/fix? Explain your changes.
This PR switches to 64 bit indexing by default for indptr array indices
in CountVectorizer, TfidfVectorizer and HashingVectorizer. It follows the
following assumptions,
- indptr and indices attributes have the same dtype in Scipy's CSR
arrays [1
<#6194 (comment)>
]
- when 64 bit indptr and indices are given to csr_matrix constructor,
it will downcast them to 32 bit if this are sufficient to hold their
contents [2
<#6473 (comment)>
]
- overflow happens in indptr which is typically negligible in size
when compared to indices
As a result, in this PR,
- in the intermediary array representation indptr is always 64 bit,
while indices is either without dtype (in CountVectorizer, etc) or 32
bit (in HashingVectorizer as the hash size if 32 bit anyway).
- the returned arrays use 32 bit indexing for both indptr and indices
if this dtype is sufficient, or use 64 bit indexing otherwise (if it's
supported with scipy > 0.14). Which makes this PR fully backward compatible.
Any other comments?
All the benchmark scripts and results can be found in here
<https://github.com/rth/notebooks/tree/master/sklearn/PR_6473>.
-
This PR was benchmarked on 20 newsgroups dataset with
bench_text_vectorizers.py from #9086
<#9086> : results
before this PR (bench_master.log) and after this PR (bench_PR.log)
show no degradation in performance or memory usage.
-
The overflow from the original issues were reproduced with run.sh
script, and on master the output is in res_master.log. After this PR,
indices in all text vectors don't overflow res_PR.log and use 64 bit
indexing when necessary. At least 90 GB RAM is required by run.sh
(tested on EC2 m4.10xlarge with 160 GB RAM).
-
norm='l2' will still fail due to cython code (related to #2969
<#2969>)
-
These changes were reviewed in #6194
<#6194> by @lesteve
<https://github.com/lesteve> @ogrisel <https://github.com/ogrisel> .
This PR extends them to HashingVectorizer and accounts for latest
modifications
in vectorizers.
-
64 bit indexing won't on Windows
<#6194 (comment)>
@psinger <https://github.com/psinger> Would you be able to confirm that
this fixes the overflow in CountVectorizer indexing? Thanks.
cc @jnothman <https://github.com/jnothman>
------------------------------
You can view, comment on, or merge this pull request online at:
#9147
Commit Summary
- Support new scipy sparse array indices, which can now be > 2^31 (<
2^63).
- Also increase size of integer values in indptr in the next step.
- Use long for both arrays if scipy >= 0.14.
- Reusing make_int_array function
- Extend changes to the HashingVectorizer
File Changes
- *M* sklearn/feature_extraction/_hashing.pyx
<https://github.com/scikit-learn/scikit-learn/pull/9147/files#diff-0>
(13)
- *M* sklearn/feature_extraction/text.py
<https://github.com/scikit-learn/scikit-learn/pull/9147/files#diff-1>
(29)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/9147.patch
- https://github.com/scikit-learn/scikit-learn/pull/9147.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9147>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz60Hy4MOjXSLoGXHfZJvmaLBXR4ycks5sFB1hgaJpZM4N9TOv>
.
|
Sure, will do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this for what it is. My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.
sklearn/feature_extraction/text.py
Outdated
indices_dtype = np.int_ | ||
else: | ||
raise ValueError(('sparse CSR array has {} non-zero ' | ||
'elements and require 64 bit indexing, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*requires
Thank you for the review @jnothman ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks. Also Windows CI is currently failing in this PR because |
btw can we get windows support in python 3 using a long long array?
…On 14 Aug 2017 6:05 pm, "Roman Yurchak" ***@***.***> wrote:
Thank you for the review @jnothman <https://github.com/jnothman> ! I
don't know how you manage to keep track of all the PRs and issues ) Sure
makes sense to wait until predictors downstream are able to handle 64bit
sparse arrays. Will try to make a few PR in that direction in the following
weeks.
Also Windows CI is currently failing in this PR because long is 32 bit on
windows, I need to fix that...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xGBbuEnO3hsmq1eP_ysQ6hg_qeBks5sX_-zgaJpZM4N9TOv>
.
|
I know, still, it's a lot of notifications per day...
Was thinking along those lines as well, will try to. |
34f53c4
to
0eb4f98
Compare
Codecov Report
@@ Coverage Diff @@
## master #9147 +/- ##
==========================================
- Coverage 96.2% 96.16% -0.04%
==========================================
Files 337 336 -1
Lines 62817 62422 -395
==========================================
- Hits 60432 60030 -402
- Misses 2385 2392 +7
Continue to review full report at Codecov.
|
819a5b7
to
9d732b6
Compare
9d732b6
to
121a95d
Compare
…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
Tweak comments
81322f2
to
8d302e4
Compare
8d302e4
to
f5b2a12
Compare
f5b2a12
to
f6a7d0d
Compare
Following earlier discussion, I rewrote this PR somewhat. The current situation is that,
I can't come up with a way to write unit tests for the 64bit case, however, I have updated tests/benchmarks here (on Linux) that test this PR on an actual dataset that produces 64 bit indices (and takes several hours to run). Codecov shows a decrease in coverage because there are no tests for the 64bit case.
With respect to this point, what currently works is normalization (fixed in #9663), dimensionality reduction (LSI, NMF), and a few linear models (e.g. LogisticRegression with lbfgs/cd solver & ElasticNet). I have not tested clustering yet. Of course, there is still work to do (I added the progress status to the parent comment of #2969), but this might be enough to have some reasonable workflow. A review would be appreciated, all tests now pass. |
@@ -784,7 +785,8 @@ def _count_vocab(self, raw_documents, fixed_vocab): | |||
|
|||
analyze = self.build_analyzer() | |||
j_indices = [] | |||
indptr = _make_int_array() | |||
indptr = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked (in the above-linked benchmark) that switching from an array.array
to list
doesn't have any negative performance impact here. j_indices
is already a list and that's where most of the data is (since j_indices
is much longer than indptr
); indptr
doesn't really matter. This simplifies things with respect to typing (and possible overflows) though.
Let's hurry in a fix for the liblinear segfault (#9545). Otherwise, yes, I'm happy to see this merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Any chance @ogrisel could review?
return (indices_a, np.frombuffer(indptr, dtype=np.int32), values[:size]) | ||
indptr_a = np.frombuffer(indptr, dtype=indices_np_dtype) | ||
|
||
if indptr[-1] > 2147483648: # = 2**31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would sort of be nice if this were refactored somewhere, but I can't think of somewhere both pleasant and useful to keep it shared.
Code-wise LGTM. Even if this is not linked to this PR, by reviewing, I found strange to exactly reallocate to specific size the indices at each iteration there. I would have think that it would be less costly to double the capacity and return a shrinked array. |
Thanks for the review @jnothman and @glemaitre ! I added the corresponding what's new entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging, thanks!
Thanks again for the review @jnothman and @glemaitre. @mannby thank you for the initial implementation! |
…cikit-learn#9147) Support new scipy sparse array indices, which can now be > 2^31 (< 2^63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
else: | ||
# On Windows with PY2.7 long int would still correspond to 32 bit. | ||
indices_array_dtype = "l" | ||
indices_np_dtype = np.int_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just use np.intp
for all cases here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that I don't know the array.array
dtype corresponding to np.intp
. Both need to match in all cases since we are using np.frombuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype(np.intp).char
will give you that.
Thanks for this fix @rth et al. It helped me a lot! |
Reference Issue
Continuation of #6473 (itself continuation of #6194). Fixes #6183 (in CountVectorizer); also fixes #8941 (in HashingVectorizer).
Closes #6473
Closes #6194
Closes #6183
Closes #8941
What does this implement/fix? Explain your changes.
This PR switches to 64 bit indexing by default for
indptr
array indices inCountVectorizer
,TfidfVectorizer
andHashingVectorizer
. It follows the following assumptions,indptr
andindices
attributes have the same dtype in Scipy's CSR arrays [1]indptr
andindices
are given tocsr_matrix
constructor, it will downcast them to 32 bit if this are sufficient to hold their contents [2]indptr
which is typically negligible in size when compared toindices
As a result, in this PR,
indptr
is always 64 bit, whileindices
is either without dtype (inCountVectorizer
, etc) or 32 bit (inHashingVectorizer
as the hash size if 32 bit anyway).indptr
andindices
if this dtype is sufficient, or use 64 bit indexing otherwise (if it's supported with scipy > 0.14). Which makes this PR fully backward compatible.Any other comments?
All the benchmark scripts and results can be found in here.
This PR was benchmarked on 20 newsgroups dataset with
bench_text_vectorizers.py
from [MRG+1] Add text vectorizers benchmarks #9086 : results before this PR (bench_master.log
) and after this PR (bench_PR.log
) show no degradation in performance or memory usage.The overflow from the original issues were reproduced with
run.sh
script, and onmaster
the output is inres_master.log
. After this PR, indices in all text vectors don't overflowres_PR.log
and use 64 bit indexing when necessary. At least 90 GB RAM is required byrun.sh
(tested on EC2m4.10xlarge
with 160 GB RAM).norm='l2' will still fail due to cython code (related to Update cython code to support 64 bit indexed sparse inputs #2969)(fixed in [MRG] Add support for 64 bit indices in CSR array normalization #9663)These changes were reviewed in Support new scipy sparse array indices, which can now be > 2^31 #6194 by @lesteve @ogrisel . This PR extends them to
HashingVectorizer
and accounts for latest modifications in vectorizers.64 bit indexing won't work on Windows@psinger Would you be able to confirm that this fixes the overflow in CountVectorizer indexing? Thanks.
cc @jnothman