Skip to content

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mannby
Copy link

@mannby mannby commented Mar 1, 2016

Enable training with very large numbers of samples and features.

I moved this pull request from my master branch to a side branch, to free up master in my repository.

Please see previous discussion in pull request #6194.


This change is Reviewable

Claes-Fredrik Mannby added 3 commits January 19, 2016 13:08
…63).

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
@psinger
Copy link

psinger commented May 4, 2016

Would be great if this could get merged soon!

Copy link

@psinger psinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works and should really get merged soon

@jnothman
Copy link
Member

Thanks for the ping, @psinger. This needs a rebase.

@rth
Copy link
Member

rth commented Jun 6, 2017

Thanks for the PR @mannby ! Wouldn't this make CountVectorizer use almost twice more memory by default for scipy > 0.14 ?

@jnothman
Copy link
Member

Do you think doubling the memory is a major risk, @rth? At least it's linear without an enormous constant... @ogrisel was +1 for that approach.

However, I have my doubts that this is often necessary after #7272, and when a number of our downstream estimators have cython code relying on 32-bit sparse indices is of limited use, so I am consigning this issue to a later release.

@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017
@rth
Copy link
Member

rth commented Jun 15, 2017

Do you think doubling the memory is a major risk, @rth? At least it's linear without an enormous constant... @ogrisel was +1 for that approach.

I might be a bit biased on this, but to hit this error one needs to process [text] data that would produce sparse arrays larger than (2**31-1)*(4 * 2+8)*1e-9 = 34 GB. I imagine that for users that use such large datasets, this is definitely something that needs fixing, but for those that work with slightly smaller datasets this would be an additional 50% memory overhead without gaining anything (even if it scales linearly). For instance, the difference between 15 GB and 23.5 GB RAM is significant, and it might break the processing pipepline of some users.

Because one doesn't know in advance the the dataset, I imagine it's not possible to automatically switch dtype as it's done in scipy,

  • Sparse matrices are no longer limited to 2^31 nonzero elements. They
    automatically switch to using 64-bit index data type for matrices containing
    more elements. [...]

but maybe it could be possible to add the indices_dtype=np.int32 parameter (analogous to the dtype parameter to the vectorizers) which would allow to keep the current behavior and to switch for people who need it to 64 bit? Though, it would add a yet another parameter to the vectorizers .. what do you think?

@jnothman
Copy link
Member

I think we need @mannby to tell us if the recent changes to CountVectorizer fix their problem without changing the index dtype. I'm also inclined to say that anyone with a dataset that big should find a way to not vectorize it all at once (but I know we don't make it particularly easy to merge vocabularies from multiple vectorizer instances).

@psinger
Copy link

psinger commented Jun 15, 2017

I think I encountered this problem in various projects over the last years. Working with text data that exceeds the limit here, is something that is quite easily achieved. I would therefore vouch for a fix.

Personally, having a separate parameter for the dtype sounds fine.

@jnothman
Copy link
Member

jnothman commented Jun 15, 2017 via email

@rth
Copy link
Member

rth commented Jun 16, 2017

However, I have my doubts that this is often necessary after #7272,

I rebased and rerun the tests on this, and this fix is still definitely necessary. The dataset size to run into this issue would be 2.1 billion words which is ~ the size of the English wikipedia, I agree it is fairly easy to reach.

Also #8941 show the same issue but for HashingVectorizer My previous memory overhead estimation was too high (as the size of indptr array is typically negligible when compared with indices), also it looks like csr_matrix will downcast any dtype indices to 32 bit when those are sufficient for indexing,

In [7]: import numpy as np
   ...: from scipy.sparse import csr_matrix
   ...: data = np.array([1, 2, 3, 4, 5, 6])
   ...: indptr = np.array([0, 2, 3, 6], dtype='int64')
   ...: indices = np.array([0, 2, 2, 0, 1, 2], dtype='int64')
   ...: indices.dtype
   ...: 
Out[7]: dtype('int64')

In [8]: csr_matrix((data, indices, indptr)).indices.dtype
Out[8]: dtype('int32')

and when a number of our downstream estimators have cython code relying on 32-bit sparse indices is of limited use

Yes, even when this is fixed, e.g. one cannot "l2" normalize the resulting array because sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 doesn't support 64-bit indices.

#2969 need fixing first: working on that..

but was that needed for the end matrix, or for the intermediate
representation which is now (very recently) much smaller

Both, I believe, even in the latest version.. (Will respond in more detail about this later)

@psinger
Copy link

psinger commented Jun 16, 2017

@rth Thanks for confirming this. Funnily, I ran into this problem when I was working with the complete English Wikipedia ;)

@jnothman
Copy link
Member

jnothman commented Jun 17, 2017 via email

@rth
Copy link
Member

rth commented Jun 17, 2017

And why would you process the entirety of Wikipedia with a single
CountVectorizer (except for a lack of preexisting CountVectorizer merging
tools)?

Because one can easily do that with gensim and one might try to run that example with scikit-learn's CountVectorizer or TdfidfVectorizer and expect it to work.. I agree, that chunking (optionally parallelizing) and merging vocabularies / results at the end would be better and more efficient, but that's another level of complexity as compared just running the default vectorizers..

@psinger
Copy link

psinger commented Jun 17, 2017

And why would you process the entirety of Wikipedia with a single
CountVectorizer (except for a lack of preexisting CountVectorizer merging
tools)?

If you have the memory for it, why not? It makes the whole process much easier instead of chunking etc.

@rth
Copy link
Member

rth commented Jun 17, 2017

Rebased and continued this PR in #9147 ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants