[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

mannby · 2016-03-01T21:23:11Z

Enable training with very large numbers of samples and features.

I moved this pull request from my master branch to a side branch, to free up master in my repository.

Please see previous discussion in pull request #6194.

This change is

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

Tweak comments

psinger · 2016-05-04T15:45:04Z

Would be great if this could get merged soon!

psinger

this works and should really get merged soon

jnothman · 2016-09-19T23:40:41Z

Thanks for the ping, @psinger. This needs a rebase.

rth · 2017-06-06T09:25:01Z

Thanks for the PR @mannby ! Wouldn't this make CountVectorizer use almost twice more memory by default for scipy > 0.14 ?

jnothman · 2017-06-14T01:08:46Z

Do you think doubling the memory is a major risk, @rth? At least it's linear without an enormous constant... @ogrisel was +1 for that approach.

However, I have my doubts that this is often necessary after #7272, and when a number of our downstream estimators have cython code relying on 32-bit sparse indices is of limited use, so I am consigning this issue to a later release.

rth · 2017-06-15T00:47:43Z

Do you think doubling the memory is a major risk, @rth? At least it's linear without an enormous constant... @ogrisel was +1 for that approach.

I might be a bit biased on this, but to hit this error one needs to process [text] data that would produce sparse arrays larger than (2**31-1)*(4 * 2+8)*1e-9 = 34 GB. I imagine that for users that use such large datasets, this is definitely something that needs fixing, but for those that work with slightly smaller datasets this would be an additional 50% memory overhead without gaining anything (even if it scales linearly). For instance, the difference between 15 GB and 23.5 GB RAM is significant, and it might break the processing pipepline of some users.

Because one doesn't know in advance the the dataset, I imagine it's not possible to automatically switch dtype as it's done in scipy,

Sparse matrices are no longer limited to 2^31 nonzero elements. They
automatically switch to using 64-bit index data type for matrices containing
more elements. [...]

but maybe it could be possible to add the indices_dtype=np.int32 parameter (analogous to the dtype parameter to the vectorizers) which would allow to keep the current behavior and to switch for people who need it to 64 bit? Though, it would add a yet another parameter to the vectorizers .. what do you think?

jnothman · 2017-06-15T01:28:40Z

I think we need @mannby to tell us if the recent changes to CountVectorizer fix their problem without changing the index dtype. I'm also inclined to say that anyone with a dataset that big should find a way to not vectorize it all at once (but I know we don't make it particularly easy to merge vocabularies from multiple vectorizer instances).

psinger · 2017-06-15T07:17:01Z

I think I encountered this problem in various projects over the last years. Working with text data that exceeds the limit here, is something that is quite easily achieved. I would therefore vouch for a fix.

Personally, having a separate parameter for the dtype sounds fine.

jnothman · 2017-06-15T09:44:49Z

but was that needed for the end matrix, or for the intermediate representation which is now (very recently) much smaller

…

On 15 Jun 2017 6:22 pm, "Philipp Singer" ***@***.***> wrote: I think I encountered this problem in various projects over the last years. Working with text data that exceeds the limit here, is something that is quite easily achieved. I would therefore vouch for a fix. Personally, having a separate parameter for the dtype sounds fine. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6473 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zFf7z8bJNFDej4txOO8rEfEvd09ks5sENpvgaJpZM4Hm35k> .

rth · 2017-06-16T19:15:11Z

However, I have my doubts that this is often necessary after #7272,

I rebased and rerun the tests on this, and this fix is still definitely necessary. The dataset size to run into this issue would be 2.1 billion words which is ~ the size of the English wikipedia, I agree it is fairly easy to reach.

Also #8941 show the same issue but for HashingVectorizer My previous memory overhead estimation was too high (as the size of indptr array is typically negligible when compared with indices), also it looks like csr_matrix will downcast any dtype indices to 32 bit when those are sufficient for indexing,

In [7]: import numpy as np
   ...: from scipy.sparse import csr_matrix
   ...: data = np.array([1, 2, 3, 4, 5, 6])
   ...: indptr = np.array([0, 2, 3, 6], dtype='int64')
   ...: indices = np.array([0, 2, 2, 0, 1, 2], dtype='int64')
   ...: indices.dtype
   ...: 
Out[7]: dtype('int64')

In [8]: csr_matrix((data, indices, indptr)).indices.dtype
Out[8]: dtype('int32')

and when a number of our downstream estimators have cython code relying on 32-bit sparse indices is of limited use

Yes, even when this is fixed, e.g. one cannot "l2" normalize the resulting array because sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 doesn't support 64-bit indices.

#2969 need fixing first: working on that..

but was that needed for the end matrix, or for the intermediate
representation which is now (very recently) much smaller

Both, I believe, even in the latest version.. (Will respond in more detail about this later)

psinger · 2017-06-16T19:18:40Z

@rth Thanks for confirming this. Funnily, I ran into this problem when I was working with the complete English Wikipedia ;)

jnothman · 2017-06-17T10:53:02Z

And why would you process the entirety of Wikipedia with a single CountVectorizer (except for a lack of preexisting CountVectorizer merging tools)?

…

On 17 June 2017 at 05:18, Philipp Singer ***@***.***> wrote: @rth <https://github.com/rth> Thanks for confirming this. Funnily, I ran into this problem when I was working with the complete English Wikipedia ;) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6473 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60N0AmK904YZ1EVT1ZBq_nkFs0okks5sEtUSgaJpZM4Hm35k> .

rth · 2017-06-17T11:21:54Z

And why would you process the entirety of Wikipedia with a single
CountVectorizer (except for a lack of preexisting CountVectorizer merging
tools)?

Because one can easily do that with gensim and one might try to run that example with scikit-learn's CountVectorizer or TdfidfVectorizer and expect it to work.. I agree, that chunking (optionally parallelizing) and merging vocabularies / results at the end would be better and more efficient, but that's another level of complexity as compared just running the default vectorizers..

psinger · 2017-06-17T11:42:19Z

And why would you process the entirety of Wikipedia with a single
CountVectorizer (except for a lack of preexisting CountVectorizer merging
tools)?

If you have the memory for it, why not? It makes the whole process much easier instead of chunking etc.

rth · 2017-06-17T18:43:09Z

Rebased and continued this PR in #9147 ..

Claes-Fredrik Mannby added 3 commits January 19, 2016 13:08

Support new scipy sparse array indices, which can now be > 2^31 (< 2^…

ddcb64d

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

Also increase size of integer values in indptr in the next step.

c75c0b8

Use long for both arrays if scipy >= 0.14.

3ec2503

Tweak comments

psinger reviewed Sep 19, 2016

View reviewed changes

jnothman added Enhancement Large Scale labels Sep 19, 2016

jnothman added this to the 0.19 milestone Sep 19, 2016

jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017

This was referenced Jun 17, 2017

Parallel text vectorization FreeDiscovery/FreeDiscovery#152

Open

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Merged

jnothman closed this in #9147 Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

mannby commented Mar 1, 2016 •

edited by amueller

Loading

psinger commented May 4, 2016

psinger left a comment

jnothman commented Sep 19, 2016

rth commented Jun 6, 2017

jnothman commented Jun 14, 2017

rth commented Jun 15, 2017 •

edited

Loading

jnothman commented Jun 15, 2017

psinger commented Jun 15, 2017

jnothman commented Jun 15, 2017 via email

rth commented Jun 16, 2017 •

edited

Loading

psinger commented Jun 16, 2017

jnothman commented Jun 17, 2017 via email

rth commented Jun 17, 2017 •

edited

Loading

psinger commented Jun 17, 2017

rth commented Jun 17, 2017

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

Conversation

mannby commented Mar 1, 2016 • edited by amueller Loading

psinger commented May 4, 2016

psinger left a comment

Choose a reason for hiding this comment

jnothman commented Sep 19, 2016

rth commented Jun 6, 2017

jnothman commented Jun 14, 2017

rth commented Jun 15, 2017 • edited Loading

jnothman commented Jun 15, 2017

psinger commented Jun 15, 2017

jnothman commented Jun 15, 2017 via email

rth commented Jun 16, 2017 • edited Loading

psinger commented Jun 16, 2017

jnothman commented Jun 17, 2017 via email

rth commented Jun 17, 2017 • edited Loading

psinger commented Jun 17, 2017

rth commented Jun 17, 2017

mannby commented Mar 1, 2016 •

edited by amueller

Loading

rth commented Jun 15, 2017 •

edited

Loading

rth commented Jun 16, 2017 •

edited

Loading

rth commented Jun 17, 2017 •

edited

Loading