Skip to content

Vectorizing memory issue #6183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hidayata opened this issue Jan 18, 2016 · 5 comments · Fixed by #9147
Closed

Vectorizing memory issue #6183

hidayata opened this issue Jan 18, 2016 · 5 comments · Fixed by #9147

Comments

@hidayata
Copy link

Hi all

I'm working with a pretty large data set and am having an issue with line 758 of text.py (CountVectorizer code):

indptr.append(len(j_indices))

In my case, the length of j_indices is larger than the maximum signed int. indptr is an int array.
I tried making indptr a long array but that leads to other bigger memory issues.

Any thoughts?

@lesteve
Copy link
Member

lesteve commented Jan 19, 2016

Can you post the full traceback ?

Also it would be great if you could put together a stand-alone example that reproduces the problem.

@hidayata
Copy link
Author

This is the traceback. A minimal example that reproduces this would likely run into 10s of GB. From what I can tell this issue is triggered when j_indices grows to a size larger than 2^31-1. j_indices appears to grow to a length equal to the number of nonzero entries in the feature matrix.

----> 1 all_vectorizer.fit(train_data)

/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.pyc in fit(self, raw_documents, y)
    787         self
    788         """
--> 789         self.fit_transform(raw_documents)
    790         return self
    791 

/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    815 
    816         vocabulary, X = self._count_vocab(raw_documents,
--> 817                                           self.fixed_vocabulary_)
    818 
    819         if self.binary:

/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    756                     # Ignore out-of-vocabulary items for fixed_vocab=True
    757                     continue
--> 758             indptr.append(len(j_indices))
    759 
    760         if not fixed_vocab:

OverflowError: signed integer is greater than maximum

@v1shwa
Copy link

v1shwa commented Nov 6, 2017

Hi, I'm still getting this error in version 0.19.1. Any work around?

@jnothman
Copy link
Member

jnothman commented Nov 6, 2017

You can try patching your copy of the library with #9147

@v1shwa
Copy link

v1shwa commented Nov 6, 2017

Wow! That was quick. Thanks @jnothman , I will give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants