-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
Tweak comments
Would be great if this could get merged soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works and should really get merged soon
Thanks for the ping, @psinger. This needs a rebase. |
Thanks for the PR @mannby ! Wouldn't this make |
Do you think doubling the memory is a major risk, @rth? At least it's linear without an enormous constant... @ogrisel was +1 for that approach. However, I have my doubts that this is often necessary after #7272, and when a number of our downstream estimators have cython code relying on 32-bit sparse indices is of limited use, so I am consigning this issue to a later release. |
I might be a bit biased on this, but to hit this error one needs to process [text] data that would produce sparse arrays larger than (2**31-1)*(4 * 2+8)*1e-9 = 34 GB. I imagine that for users that use such large datasets, this is definitely something that needs fixing, but for those that work with slightly smaller datasets this would be an additional 50% memory overhead without gaining anything (even if it scales linearly). For instance, the difference between 15 GB and 23.5 GB RAM is significant, and it might break the processing pipepline of some users. Because one doesn't know in advance the the dataset, I imagine it's not possible to automatically switch dtype as it's done in scipy,
but maybe it could be possible to add the |
I think we need @mannby to tell us if the recent changes to |
I think I encountered this problem in various projects over the last years. Working with text data that exceeds the limit here, is something that is quite easily achieved. I would therefore vouch for a fix. Personally, having a separate parameter for the dtype sounds fine. |
but was that needed for the end matrix, or for the intermediate
representation which is now (very recently) much smaller
…On 15 Jun 2017 6:22 pm, "Philipp Singer" ***@***.***> wrote:
I think I encountered this problem in various projects over the last
years. Working with text data that exceeds the limit here, is something
that is quite easily achieved. I would therefore vouch for a fix.
Personally, having a separate parameter for the dtype sounds fine.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6473 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zFf7z8bJNFDej4txOO8rEfEvd09ks5sENpvgaJpZM4Hm35k>
.
|
I rebased and rerun the tests on this, and this fix is still definitely necessary. The dataset size to run into this issue would be 2.1 billion words which is ~ the size of the English wikipedia, I agree it is fairly easy to reach. Also #8941 show the same issue but for In [7]: import numpy as np
...: from scipy.sparse import csr_matrix
...: data = np.array([1, 2, 3, 4, 5, 6])
...: indptr = np.array([0, 2, 3, 6], dtype='int64')
...: indices = np.array([0, 2, 2, 0, 1, 2], dtype='int64')
...: indices.dtype
...:
Out[7]: dtype('int64')
In [8]: csr_matrix((data, indices, indptr)).indices.dtype
Out[8]: dtype('int32')
Yes, even when this is fixed, e.g. one cannot "l2" #2969 need fixing first: working on that..
Both, I believe, even in the latest version.. (Will respond in more detail about this later) |
@rth Thanks for confirming this. Funnily, I ran into this problem when I was working with the complete English Wikipedia ;) |
And why would you process the entirety of Wikipedia with a single
CountVectorizer (except for a lack of preexisting CountVectorizer merging
tools)?
…On 17 June 2017 at 05:18, Philipp Singer ***@***.***> wrote:
@rth <https://github.com/rth> Thanks for confirming this. Funnily, I ran
into this problem when I was working with the complete English Wikipedia ;)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6473 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60N0AmK904YZ1EVT1ZBq_nkFs0okks5sEtUSgaJpZM4Hm35k>
.
|
Because one can easily do that with gensim and one might try to run that example with scikit-learn's |
If you have the memory for it, why not? It makes the whole process much easier instead of chunking etc. |
Rebased and continued this PR in #9147 .. |
Enable training with very large numbers of samples and features.
I moved this pull request from my master branch to a side branch, to free up master in my repository.
Please see previous discussion in pull request #6194.
This change is