[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

rth · 2017-06-17T18:39:24Z

Reference Issue

Continuation of #6473 (itself continuation of #6194). Fixes #6183 (in CountVectorizer); also fixes #8941 (in HashingVectorizer).

Closes #6473
Closes #6194
Closes #6183
Closes #8941

What does this implement/fix? Explain your changes.

This PR switches to 64 bit indexing by default for indptr array indices in CountVectorizer, TfidfVectorizer and HashingVectorizer. It follows the following assumptions,

indptr and indices attributes have the same dtype in Scipy's CSR arrays [1]
when 64 bit indptr and indices are given to csr_matrix constructor, it will downcast them to 32 bit if this are sufficient to hold their contents [2]
overflow happens in indptr which is typically negligible in size when compared to indices

As a result, in this PR,

in the intermediary array representation indptr is always 64 bit, while indices is either without dtype (in CountVectorizer, etc) or 32 bit (in HashingVectorizer as the hash size if 32 bit anyway).
the returned arrays use 32 bit indexing for both indptr and indices if this dtype is sufficient, or use 64 bit indexing otherwise (if it's supported with scipy > 0.14). Which makes this PR fully backward compatible.

Any other comments?

All the benchmark scripts and results can be found in here.

This PR was benchmarked on 20 newsgroups dataset with bench_text_vectorizers.py from [MRG+1] Add text vectorizers benchmarks #9086 : results before this PR (bench_master.log) and after this PR (bench_PR.log) show no degradation in performance or memory usage.
The overflow from the original issues were reproduced with run.sh script, and on master the output is in res_master.log. After this PR, indices in all text vectors don't overflow res_PR.log and use 64 bit indexing when necessary. At least 90 GB RAM is required by run.sh (tested on EC2 m4.10xlarge with 160 GB RAM).
~~norm='l2' will still fail due to cython code (related to Update cython code to support 64 bit indexed sparse inputs #2969)~~ (fixed in [MRG] Add support for 64 bit indices in CSR array normalization #9663)
These changes were reviewed in Support new scipy sparse array indices, which can now be > 2^31 #6194 by @lesteve @ogrisel . This PR extends them to HashingVectorizer and accounts for latest modifications in vectorizers.
~~64 bit indexing won't work on Windows~~

@psinger Would you be able to confirm that this fixes the overflow in CountVectorizer indexing? Thanks.

cc @jnothman

jnothman · 2017-06-17T23:42:49Z

I'd rather leave this until after release. ping then

…

On 18 Jun 2017 4:39 am, "Roman Yurchak" ***@***.***> wrote: Reference Issue Continuation of #6473 <#6473> (itself continuation of #6194 <#6194>). Fixes #6183 <#6183> (in CountVectorizer); also fixes #8941 <#8941> (in HashingVectorizer). What does this implement/fix? Explain your changes. This PR switches to 64 bit indexing by default for indptr array indices in CountVectorizer, TfidfVectorizer and HashingVectorizer. It follows the following assumptions, - indptr and indices attributes have the same dtype in Scipy's CSR arrays [1 <#6194 (comment)> ] - when 64 bit indptr and indices are given to csr_matrix constructor, it will downcast them to 32 bit if this are sufficient to hold their contents [2 <#6473 (comment)> ] - overflow happens in indptr which is typically negligible in size when compared to indices As a result, in this PR, - in the intermediary array representation indptr is always 64 bit, while indices is either without dtype (in CountVectorizer, etc) or 32 bit (in HashingVectorizer as the hash size if 32 bit anyway). - the returned arrays use 32 bit indexing for both indptr and indices if this dtype is sufficient, or use 64 bit indexing otherwise (if it's supported with scipy > 0.14). Which makes this PR fully backward compatible. Any other comments? All the benchmark scripts and results can be found in here <https://github.com/rth/notebooks/tree/master/sklearn/PR_6473>. - This PR was benchmarked on 20 newsgroups dataset with bench_text_vectorizers.py from #9086 <#9086> : results before this PR (bench_master.log) and after this PR (bench_PR.log) show no degradation in performance or memory usage. - The overflow from the original issues were reproduced with run.sh script, and on master the output is in res_master.log. After this PR, indices in all text vectors don't overflow res_PR.log and use 64 bit indexing when necessary. At least 90 GB RAM is required by run.sh (tested on EC2 m4.10xlarge with 160 GB RAM). - norm='l2' will still fail due to cython code (related to #2969 <#2969>) - These changes were reviewed in #6194 <#6194> by @lesteve <https://github.com/lesteve> @ogrisel <https://github.com/ogrisel> . This PR extends them to HashingVectorizer and accounts for latest modifications in vectorizers. - 64 bit indexing won't on Windows <#6194 (comment)> @psinger <https://github.com/psinger> Would you be able to confirm that this fixes the overflow in CountVectorizer indexing? Thanks. cc @jnothman <https://github.com/jnothman> ------------------------------ You can view, comment on, or merge this pull request online at: #9147 Commit Summary - Support new scipy sparse array indices, which can now be > 2^31 (< 2^63). - Also increase size of integer values in indptr in the next step. - Use long for both arrays if scipy >= 0.14. - Reusing make_int_array function - Extend changes to the HashingVectorizer File Changes - *M* sklearn/feature_extraction/_hashing.pyx <https://github.com/scikit-learn/scikit-learn/pull/9147/files#diff-0> (13) - *M* sklearn/feature_extraction/text.py <https://github.com/scikit-learn/scikit-learn/pull/9147/files#diff-1> (29) Patch Links: - https://github.com/scikit-learn/scikit-learn/pull/9147.patch - https://github.com/scikit-learn/scikit-learn/pull/9147.diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9147>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60Hy4MOjXSLoGXHfZJvmaLBXR4ycks5sFB1hgaJpZM4N9TOv> .

rth · 2017-06-18T11:25:45Z

Sure, will do.

jnothman

I'm okay with this for what it is. My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

jnothman · 2017-08-14T03:10:18Z

sklearn/feature_extraction/text.py

+                indices_dtype = np.int_
+            else:
+                raise ValueError(('sparse CSR array has {} non-zero '
+                                  'elements and require 64 bit indexing, '


rth · 2017-08-14T08:05:04Z

Thank you for the review @jnothman ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks.

Also Windows CI is currently failing in this PR because long is 32 bit on windows, I need to fix that...

jnothman · 2017-08-14T08:07:20Z

I don't know how you manage to keep track of all the PRs and issues

Uhh....

I promise as much gets lost as gets tracked...

jnothman · 2017-08-14T08:15:22Z

btw can we get windows support in python 3 using a long long array?

…

On 14 Aug 2017 6:05 pm, "Roman Yurchak" ***@***.***> wrote: Thank you for the review @jnothman <https://github.com/jnothman> ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks. Also Windows CI is currently failing in this PR because long is 32 bit on windows, I need to fix that... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9147 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xGBbuEnO3hsmq1eP_ysQ6hg_qeBks5sX_-zgaJpZM4N9TOv> .

rth · 2017-08-14T08:23:07Z

I don't know how you manage to keep track of all the PRs and issues

screen shot 2017-08-14 at 6 05 35 pm

I know, still, it's a lot of notifications per day...

btw can we get windows support in python 3 using a long long array?

Was thinking along those lines as well, will try to.

codecov · 2017-10-01T09:24:02Z

Codecov Report

Merging #9147 into master will decrease coverage by 0.03%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master    #9147      +/-   ##
==========================================
- Coverage    96.2%   96.16%   -0.04%     
==========================================
  Files         337      336       -1     
  Lines       62817    62422     -395     
==========================================
- Hits        60432    60030     -402     
- Misses       2385     2392       +7

Impacted Files	Coverage Δ
sklearn/feature_extraction/text.py	`95.96% <78.57%> (-0.64%)`	⬇️
sklearn/linear_model/tests/test_bayes.py	`85.48% <0%> (-4.06%)`	⬇️
sklearn/utils/testing.py	`75.94% <0%> (-1.06%)`	⬇️
sklearn/utils/deprecation.py	`95.83% <0%> (-1.05%)`	⬇️
sklearn/decomposition/pca.py	`94.55% <0%> (-0.64%)`	⬇️
sklearn/preprocessing/_function_transformer.py	`97.29% <0%> (-0.58%)`	⬇️
sklearn/metrics/ranking.py	`98.31% <0%> (-0.5%)`	⬇️
sklearn/gaussian_process/correlation_models.py	`64.1% <0%> (-0.46%)`	⬇️
sklearn/ensemble/gradient_boosting.py	`95.76% <0%> (-0.45%)`	⬇️
sklearn/datasets/samples_generator.py	`93.36% <0%> (-0.32%)`	⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bead39...121a95d. Read the comment docs.

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

Tweak comments

rth · 2017-11-22T23:28:01Z

Following earlier discussion, I rewrote this PR somewhat.

The current situation is that,

internally
(a) for PY >3.3 on all platforms longlong indices (i.e. 64 bit) are going to be used for indptr
(b) otherwise long indices are used for indptr (i.e. 64 bit on Unix like and 32 bit on Windows)
The indices remain 32 bit in any case (i.e. the vocabulary size must be lower than 2e9) which allows to avoid memory overhead.
externally (i.e. from the users' perspective)
(a) if either Linux (any Python version) or Win (with Python >3.3) is used, the resulting sparse arrays will have 32 bit or 64 bit indices as necessary. The 64 bit sparse array indices will only work for scipy >= 0.14, and error is raised otherwise.
(b) if Windows is used with Python 2, only 32 bit indices are supported. A similar overflow error message would be printed to the one observed in the parent issues.

I can't come up with a way to write unit tests for the 64bit case, however, I have updated tests/benchmarks here (on Linux) that test this PR on an actual dataset that produces 64 bit indices (and takes several hours to run). Codecov shows a decrease in coverage because there are no tests for the 64bit case.

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

With respect to this point, what currently works is normalization (fixed in #9663), dimensionality reduction (LSI, NMF), and a few linear models (e.g. LogisticRegression with lbfgs/cd solver & ElasticNet). I have not tested clustering yet. Of course, there is still work to do (I added the progress status to the parent comment of #2969), but this might be enough to have some reasonable workflow.

A review would be appreciated, all tests now pass.

rth · 2017-11-22T23:31:46Z

sklearn/feature_extraction/text.py

@@ -784,7 +785,8 @@ def _count_vocab(self, raw_documents, fixed_vocab):

        analyze = self.build_analyzer()
        j_indices = []
-        indptr = _make_int_array()
+        indptr = []


I checked (in the above-linked benchmark) that switching from an array.array to list doesn't have any negative performance impact here. j_indices is already a list and that's where most of the data is (since j_indices is much longer than indptr); indptr doesn't really matter. This simplifies things with respect to typing (and possible overflows) though.

jnothman · 2017-11-26T11:10:16Z

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great.

Let's hurry in a fix for the liblinear segfault (#9545). Otherwise, yes, I'm happy to see this merged.

jnothman

This LGTM. Any chance @ogrisel could review?

jnothman · 2017-11-26T11:14:33Z

sklearn/feature_extraction/_hashing.pyx

-    return (indices_a, np.frombuffer(indptr, dtype=np.int32), values[:size])
+    indptr_a = np.frombuffer(indptr, dtype=indices_np_dtype)
+
+    if indptr[-1] > 2147483648:  # = 2**31


It would sort of be nice if this were refactored somewhere, but I can't think of somewhere both pleasant and useful to keep it shared.

glemaitre · 2017-11-29T13:02:26Z

Code-wise LGTM.

Even if this is not linked to this PR, by reviewing, I found strange to exactly reallocate to specific size the indices at each iteration there. I would have think that it would be less costly to double the capacity and return a shrinked array.

rth · 2017-11-29T15:44:04Z

Thanks for the review @jnothman and @glemaitre !

I added the corresponding what's new entry.

jnothman

Merging, thanks!

rth · 2017-11-29T22:45:37Z

Thanks again for the review @jnothman and @glemaitre.

@mannby thank you for the initial implementation!

…cikit-learn#9147) Support new scipy sparse array indices, which can now be > 2^31 (< 2^63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

eric-wieser · 2018-08-09T07:08:30Z

sklearn/feature_extraction/_hashing.pyx

+    else:
+        # On Windows with PY2.7 long int would still correspond to 32 bit. 
+        indices_array_dtype = "l"
+        indices_np_dtype = np.int_


Can you just use np.intp for all cases here?

The issue is that I don't know the array.array dtype corresponding to np.intp. Both need to match in all cases since we are using np.frombuffer.

np.dtype(np.intp).char will give you that.

robguinness · 2018-09-20T07:26:00Z

Thanks for this fix @rth et al. It helped me a lot!

rth mentioned this pull request Jun 17, 2017

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

Closed

rth changed the title ~~Support for 64 bit sparse array indices in text vectorizers~~ [MRG] Support for 64 bit sparse array indices in text vectorizers Jun 18, 2017

jnothman reviewed Aug 14, 2017

View reviewed changes

rth mentioned this pull request Aug 28, 2017

sklearn/feature_extraction/_hashing.pyx size is int32 but should be larger #9642

Closed

erg mentioned this pull request Aug 28, 2017

FIX Change int32_t to be a uint64_t in hashing (#9642) #9643

Closed

rth force-pushed the large-number-of-features branch 3 times, most recently from 34f53c4 to 0eb4f98 Compare September 1, 2017 16:01

rth force-pushed the large-number-of-features branch 6 times, most recently from 819a5b7 to 9d732b6 Compare October 1, 2017 10:35

jnothman mentioned this pull request Nov 6, 2017

Vectorizing memory issue #6183

Closed

rth force-pushed the large-number-of-features branch from 9d732b6 to 121a95d Compare November 11, 2017 22:41

Claes-Fredrik Mannby and others added 4 commits November 22, 2017 09:38

Support new scipy sparse array indices, which can now be > 2^31 (< 2^…

d0787de

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

Also increase size of integer values in indptr in the next step.

200fac0

Use long for both arrays if scipy >= 0.14.

1278bbc

Tweak comments

Reusing make_int_array function

564f8b7

rth force-pushed the large-number-of-features branch 3 times, most recently from 81322f2 to 8d302e4 Compare November 22, 2017 15:36

rth force-pushed the large-number-of-features branch from 8d302e4 to f5b2a12 Compare November 22, 2017 15:38

Rewrite the 64 bit index support of CSR arrays

f6a7d0d

rth force-pushed the large-number-of-features branch from f5b2a12 to f6a7d0d Compare November 22, 2017 15:39

rth commented Nov 22, 2017

View reviewed changes

jnothman reviewed Nov 26, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Support for 64 bit sparse array indices in text vectorizers~~ [MRG+1] Support for 64 bit sparse array indices in text vectorizers Nov 26, 2017

rth added 2 commits November 29, 2017 16:28

Add what's new entry

b230cfd

Merge branch 'master' into large-number-of-features

7845f6d

jnothman reviewed Nov 29, 2017

View reviewed changes

jnothman merged commit e5a5e77 into scikit-learn:master Nov 29, 2017

rth deleted the large-number-of-features branch November 29, 2017 22:45

xiaokc mentioned this pull request Apr 8, 2018

sklearn 0.19.1 OverflowError: signed integer is greater than maximum #10937

Closed

eric-wieser reviewed Aug 9, 2018

View reviewed changes

rth mentioned this pull request Aug 14, 2018

[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

Merged

robguinness mentioned this pull request Sep 19, 2018

Error with CountVectorizer OverflowError: signed integer is greater than maximum #12112

Closed

rth mentioned this pull request Nov 2, 2019

[MRG] Fix OverflowError on DictVectorizer #15463

Merged

Uh oh!

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Uh oh!

Conversation

rth commented Jun 17, 2017 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Jun 17, 2017 via email

Uh oh!

rth commented Jun 18, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

rth commented Aug 14, 2017

Uh oh!

jnothman commented Aug 14, 2017

Uh oh!

jnothman commented Aug 14, 2017 via email

Uh oh!

rth commented Aug 14, 2017

Uh oh!

codecov bot commented Oct 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rth commented Nov 22, 2017

Uh oh!

rth Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 26, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Nov 29, 2017

Uh oh!

rth commented Nov 29, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Nov 29, 2017

Uh oh!

eric-wieser Aug 9, 2018

Choose a reason for hiding this comment

Uh oh!

rth Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

eric-wieser Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

robguinness commented Sep 20, 2018

Uh oh!

Uh oh!

rth commented Jun 17, 2017 •

edited by jnothman

Loading

codecov bot commented Oct 1, 2017 •

edited

Loading

rth Nov 22, 2017 •

edited

Loading