Skip to content

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 29, 2017

Conversation

rth
Copy link
Member

@rth rth commented Jun 17, 2017

Reference Issue

Continuation of #6473 (itself continuation of #6194). Fixes #6183 (in CountVectorizer); also fixes #8941 (in HashingVectorizer).

Closes #6473
Closes #6194
Closes #6183
Closes #8941

What does this implement/fix? Explain your changes.

This PR switches to 64 bit indexing by default for indptr array indices in CountVectorizer, TfidfVectorizer and HashingVectorizer. It follows the following assumptions,

  • indptr and indices attributes have the same dtype in Scipy's CSR arrays [1]
  • when 64 bit indptr and indices are given to csr_matrix constructor, it will downcast them to 32 bit if this are sufficient to hold their contents [2]
  • overflow happens in indptr which is typically negligible in size when compared to indices

As a result, in this PR,

  • in the intermediary array representation indptr is always 64 bit, while indices is either without dtype (in CountVectorizer, etc) or 32 bit (in HashingVectorizer as the hash size if 32 bit anyway).
  • the returned arrays use 32 bit indexing for both indptr and indices if this dtype is sufficient, or use 64 bit indexing otherwise (if it's supported with scipy > 0.14). Which makes this PR fully backward compatible.

Any other comments?

All the benchmark scripts and results can be found in here.

@psinger Would you be able to confirm that this fixes the overflow in CountVectorizer indexing? Thanks.

cc @jnothman

@jnothman
Copy link
Member

jnothman commented Jun 17, 2017 via email

@rth rth changed the title Support for 64 bit sparse array indices in text vectorizers [MRG] Support for 64 bit sparse array indices in text vectorizers Jun 18, 2017
@rth
Copy link
Member Author

rth commented Jun 18, 2017

Sure, will do.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this for what it is. My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

indices_dtype = np.int_
else:
raise ValueError(('sparse CSR array has {} non-zero '
'elements and require 64 bit indexing, '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*requires

@rth
Copy link
Member Author

rth commented Aug 14, 2017

Thank you for the review @jnothman ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks.

Also Windows CI is currently failing in this PR because long is 32 bit on windows, I need to fix that...

@jnothman
Copy link
Member

I don't know how you manage to keep track of all the PRs and issues

Uhh....
screen shot 2017-08-14 at 6 05 35 pm

I promise as much gets lost as gets tracked...

@jnothman
Copy link
Member

jnothman commented Aug 14, 2017 via email

@rth
Copy link
Member Author

rth commented Aug 14, 2017

I don't know how you manage to keep track of all the PRs and issues

screen shot 2017-08-14 at 6 05 35 pm

I know, still, it's a lot of notifications per day...

btw can we get windows support in python 3 using a long long array?

Was thinking along those lines as well, will try to.

@codecov
Copy link

codecov bot commented Oct 1, 2017

Codecov Report

Merging #9147 into master will decrease coverage by 0.03%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9147      +/-   ##
==========================================
- Coverage    96.2%   96.16%   -0.04%     
==========================================
  Files         337      336       -1     
  Lines       62817    62422     -395     
==========================================
- Hits        60432    60030     -402     
- Misses       2385     2392       +7
Impacted Files Coverage Δ
sklearn/feature_extraction/text.py 95.96% <78.57%> (-0.64%) ⬇️
sklearn/linear_model/tests/test_bayes.py 85.48% <0%> (-4.06%) ⬇️
sklearn/utils/testing.py 75.94% <0%> (-1.06%) ⬇️
sklearn/utils/deprecation.py 95.83% <0%> (-1.05%) ⬇️
sklearn/decomposition/pca.py 94.55% <0%> (-0.64%) ⬇️
sklearn/preprocessing/_function_transformer.py 97.29% <0%> (-0.58%) ⬇️
sklearn/metrics/ranking.py 98.31% <0%> (-0.5%) ⬇️
sklearn/gaussian_process/correlation_models.py 64.1% <0%> (-0.46%) ⬇️
sklearn/ensemble/gradient_boosting.py 95.76% <0%> (-0.45%) ⬇️
sklearn/datasets/samples_generator.py 93.36% <0%> (-0.32%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bead39...121a95d. Read the comment docs.

@rth rth force-pushed the large-number-of-features branch 6 times, most recently from 819a5b7 to 9d732b6 Compare October 1, 2017 10:35
@rth rth force-pushed the large-number-of-features branch from 9d732b6 to 121a95d Compare November 11, 2017 22:41
Claes-Fredrik Mannby and others added 4 commits November 22, 2017 09:38
…63).

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
@rth rth force-pushed the large-number-of-features branch 3 times, most recently from 81322f2 to 8d302e4 Compare November 22, 2017 15:36
@rth rth force-pushed the large-number-of-features branch from 8d302e4 to f5b2a12 Compare November 22, 2017 15:38
@rth rth force-pushed the large-number-of-features branch from f5b2a12 to f6a7d0d Compare November 22, 2017 15:39
@rth
Copy link
Member Author

rth commented Nov 22, 2017

Following earlier discussion, I rewrote this PR somewhat.

The current situation is that,

  • internally
    (a) for PY >3.3 on all platforms longlong indices (i.e. 64 bit) are going to be used for indptr
    (b) otherwise long indices are used for indptr (i.e. 64 bit on Unix like and 32 bit on Windows)
    The indices remain 32 bit in any case (i.e. the vocabulary size must be lower than 2e9) which allows to avoid memory overhead.

  • externally (i.e. from the users' perspective)
    (a) if either Linux (any Python version) or Win (with Python >3.3) is used, the resulting sparse arrays will have 32 bit or 64 bit indices as necessary. The 64 bit sparse array indices will only work for scipy >= 0.14, and error is raised otherwise.
    (b) if Windows is used with Python 2, only 32 bit indices are supported. A similar overflow error message would be printed to the one observed in the parent issues.

I can't come up with a way to write unit tests for the 64bit case, however, I have updated tests/benchmarks here (on Linux) that test this PR on an actual dataset that produces 64 bit indices (and takes several hours to run). Codecov shows a decrease in coverage because there are no tests for the 64bit case.

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

With respect to this point, what currently works is normalization (fixed in #9663), dimensionality reduction (LSI, NMF), and a few linear models (e.g. LogisticRegression with lbfgs/cd solver & ElasticNet). I have not tested clustering yet. Of course, there is still work to do (I added the progress status to the parent comment of #2969), but this might be enough to have some reasonable workflow.

A review would be appreciated, all tests now pass.

@@ -784,7 +785,8 @@ def _count_vocab(self, raw_documents, fixed_vocab):

analyze = self.build_analyzer()
j_indices = []
indptr = _make_int_array()
indptr = []
Copy link
Member Author

@rth rth Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked (in the above-linked benchmark) that switching from an array.array to list doesn't have any negative performance impact here. j_indices is already a list and that's where most of the data is (since j_indices is much longer than indptr); indptr doesn't really matter. This simplifies things with respect to typing (and possible overflows) though.

@jnothman
Copy link
Member

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great.

Let's hurry in a fix for the liblinear segfault (#9545). Otherwise, yes, I'm happy to see this merged.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Any chance @ogrisel could review?

return (indices_a, np.frombuffer(indptr, dtype=np.int32), values[:size])
indptr_a = np.frombuffer(indptr, dtype=indices_np_dtype)

if indptr[-1] > 2147483648: # = 2**31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would sort of be nice if this were refactored somewhere, but I can't think of somewhere both pleasant and useful to keep it shared.

@jnothman jnothman changed the title [MRG] Support for 64 bit sparse array indices in text vectorizers [MRG+1] Support for 64 bit sparse array indices in text vectorizers Nov 26, 2017
@glemaitre
Copy link
Member

Code-wise LGTM.

Even if this is not linked to this PR, by reviewing, I found strange to exactly reallocate to specific size the indices at each iteration there. I would have think that it would be less costly to double the capacity and return a shrinked array.

@rth
Copy link
Member Author

rth commented Nov 29, 2017

Thanks for the review @jnothman and @glemaitre !

I added the corresponding what's new entry.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging, thanks!

@jnothman jnothman merged commit e5a5e77 into scikit-learn:master Nov 29, 2017
@rth
Copy link
Member Author

rth commented Nov 29, 2017

Thanks again for the review @jnothman and @glemaitre.

@mannby thank you for the initial implementation!

@rth rth deleted the large-number-of-features branch November 29, 2017 22:45
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…cikit-learn#9147)

Support new scipy sparse array indices, which can now be > 2^31 (< 2^63).
This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
else:
# On Windows with PY2.7 long int would still correspond to 32 bit.
indices_array_dtype = "l"
indices_np_dtype = np.int_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use np.intp for all cases here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that I don't know the array.array dtype corresponding to np.intp. Both need to match in all cases since we are using np.frombuffer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.dtype(np.intp).char will give you that.

@robguinness
Copy link

Thanks for this fix @rth et al. It helped me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants