-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on investigating! Overall I think it's the right fix.
If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers
The case of 64 bit indptr
was fixed in #9147 in which case indices
are also 64 bit, by side effect, due to constraints of scipy CSR matrices, if I remember correctly (see related discussion #9147 (comment)). Needing 64 bit indices is not actually something that happens too much - that would mean a vocabulary size of 2e9 tokens ( total word count in English wikipedia but of unique tokens).
However, when we do use 64bit indices/indptr I think this change would mean one less memory copy due to dtype conversion + better validity of the output array, so it's all good.
See other comments below,
|
||
vec._sort_features(X, vocabulary) | ||
|
||
assert_equal(INDEX_DTYPE, X.indices.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert INDEX_DTYPE == X.indices.dtype
will produce nicer error messages with pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, will update
"great!": 2 | ||
} | ||
|
||
vec._sort_features(X, vocabulary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docstring, this returns the reordered matrix, so let's rather validate,
Xs = vec._sort_features(X, vocabulary)
Just to be safe. Looking at the code, I agree that it does inplace modification of X, but there is no guarantee that it will continue to do that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense, I'll update.
# force indices and indptr to int64. | ||
INDEX_DTYPE = np.int64 | ||
X.indices = X.indices.astype(INDEX_DTYPE, copy=False) | ||
X.indptr = X.indptr.astype(INDEX_DTYPE, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure copy=True
has an impact here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure -- I suppose it doesn't matter if it's a copy or not.
# If a count vectorizer has to store >= 2**31 count values, the sparse | ||
# storage matrix has 64bit indices / indptrs. This requires ~2*8*2**31 | ||
# bytes of memory in practice, so we just test the method that would | ||
# hypothetically fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just say,
"""
Check that _sort_features
preserves the indices dtype of sparse arrays with 64 bit indices
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll update it to something cleaner.
X = sparse.csr_matrix((5, 5), dtype=np.int64) | ||
|
||
# force indices and indptr to int64. | ||
INDEX_DTYPE = np.int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INDICES_DTYPE
if we want to be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update.
@@ -1097,3 +1097,28 @@ def test_vectorizers_invalid_ngram_range(vec): | |||
if isinstance(vec, HashingVectorizer): | |||
assert_raise_message( | |||
ValueError, message, vec.transform, ["good news everyone"]) | |||
|
|||
|
|||
@pytest.mark.parametrize("vec", [CountVectorizer()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to parametrize it, since we are only checking the CountVectorizer
(which should be sufficient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure at first if we needed to test this with other vectorizers, but I don't think so.
9c0b813
to
6d6099d
Compare
Yes this will likely never happen in practice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CIs unhappy.
Looking into it! |
It would be great to have this. @gvacaliuc any chance you could fix the CI (and resolve the conflict)? Thanks! |
Hey there, sorry to ghost. I'll mess around with it tonight and see if I can figure it out. |
6d6099d
to
59d12b8
Compare
* added test to show example of issue in scikit-learn#7762 * fixes error caused by manually manipulating sparse indices in `CountVectorizer`
59d12b8
to
317a169
Compare
Sorry, life's been busy lately. Appveyor fails on 32-bit Python 2.7.8. IIUC this issue would never arise on a 32bit arch, but I'll set up a test environment to figure out what's going on. The strange thing is that this is the relevant bit of code: map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)
for new_val, (term, old_val) in enumerate(sorted_features):
vocabulary[term] = new_val
map_index[old_val] = new_val
X.indices = map_index.take(X.indices, mode='clip') And appveyor is failing on that last line due to a type error (cannot cast int64 to int32). Clearly map_index and X.indices are the same dtype, so my guess off the bat is that there's some issue with trying to index the map_index array with 64bit integers on a 32bit architecture, and that under the hood numpy tries to cast whatever dtype the index array is to int32. |
I will admit I also find this quite inexplicable...
|
I can reproduce 64 bit indexing issue you see in What we could do is to just use previous behavior (and skip the added tests) on 32 bit systems. Here is the code that could be used for detection (and skipping tests) |
@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary): | |||
Returns a reordered matrix and modifies the vocabulary in place | |||
""" | |||
sorted_features = sorted(six.iteritems(vocabulary)) | |||
map_index = np.empty(len(sorted_features), dtype=np.int32) | |||
map_index = np.empty(len(sorted_features), dtype=X.indices.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indices should be of dtype np.intp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our np.intp
would overflow?
scikit-learn/sklearn/feature_extraction/text.py
Lines 944 to 955 in 317a169
if indptr[-1] > 2147483648: # = 2**31 - 1 | |
if sp_version >= (0, 14): | |
indices_dtype = np.int64 | |
else: | |
raise ValueError(('sparse CSR array has {} non-zero ' | |
'elements and requires 64 bit indexing, ' | |
' which is unsupported with scipy {}. ' | |
'Please upgrade to scipy >=0.14') | |
.format(indptr[-1], '.'.join(sp_version))) | |
else: | |
indices_dtype = np.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy?
At any rate, it seems to be that indices_dtype
should always be np.intp
- if that's not possible, then you should pick between np.int32
(on old scipy) and np.intp
(on fixed scipy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.
Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert.
@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary): | |||
Returns a reordered matrix and modifies the vocabulary in place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but: This seems to reorder X
in place too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it does -- I can update that when we decide what the desired behavior is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, I think, to reduce memory usage, why would it be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an issue only because the documentation doesn't tell me that's going to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to reorder X in place too.
Agreed, we should add a note about it but in a separate PR.
Thanks a lot for your feedback @eric-wieser and @pv on this !
That was my basic principle for #9147. I'll admit I have not thought too much about 32 bit architectures, beyond making the tests pass (which also run on 32 bit appveyor). It possible that there is a more elegant solution (using In this scikit-learn context, few users would need to vectorize enough text data that would produce a 64bit indexed sparse arrays: just the resulting sparse array would take >30-50 GB memory which comes closes to the inherent 64 GB RAM limitations of 32 architectures (with PAE). So currently we support this use case, with recent scipy, Unux or (Windows with Python 3) and (possibly?) 64 bit architectures only, and fail in all other cases. I don't think there is a need to support 32 architectures here. On the other hand, we do need to support all OS, python versions and 32 bit/64 bit architectures for smaller text datasets that would produce 32 bit indexed sparse arrays. So in the end, I am not sure what can/should be done here beyond what I proposed in #11295 (comment) but you might have more better ideas.. |
Sorry accidental key combination... Updated unfinished comment (#11295 (comment)) above. |
Merged master in to resolve conflicts. @gvacaliuc please run a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Xs = CountVectorizer()._sort_features(X, vocabulary) | ||
|
||
assert INDICES_DTYPE == Xs.indices.dtype | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PEP8 is complaining that a blank line contains whitespace here?
This should be a candidate for 0.20.3 (have we got a way to mark that?) |
Doesn't creating a milestone and putting it in there make sense? We can then backport them all I guess. |
Thanks for your work on this @gvacaliuc
I think it's acceptable not to support 64 bit sparse indices on 32 bit architectures (particularly if the alternative is some performance cost on all platforms). The most case for 32 bit arch I can think about (Raspberry PI & embedded, Windows 32 bit, web assembly) are not typically used in cases that need processing 10s of GB of sparse data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gvacaliuc ! Looks good apart for a minor comment below.
sklearn/feature_extraction/text.py
Outdated
@@ -961,12 +962,17 @@ def _count_vocab(self, raw_documents, fixed_vocab): | |||
" contain stop words") | |||
|
|||
if indptr[-1] > 2147483648: # = 2**31 - 1 | |||
if sp_version >= (0, 14): | |||
if sp_version >= (0, 14) and not _IS_32BIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current minimal supported scipy version is 0.17, so you can just remove the scipy version check and remove the last else clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, will update now!
Let me know if there's any other changes I should make, it looks like the failing check is a code coverage issue. Since I didn't add much code, I think the only test I could add is to check that 32bit python throws the error, but that might require a very large vocab 😮 |
Yes, presumably because the error isn't tested in our ci
|
I'm going to take @rth's comment as an approval and merge this. |
Thanks @gvacaliuc! |
Wait... Lacks a changelog entry. @gvacaliuc Please add an entry to the change log at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (apart for the missing what's new entry).
OK, just added the change log entry. Let me know if I missed anything else 😄 |
Thanks @gvacaliuc |
scikit-learn#11295)" This reverts commit 1da72e7.
scikit-learn#11295)" This reverts commit 1da72e7.
CountVectorizer
Reference Issues/PRs
#7762
What does this implement/fix? Explain your changes.
If a
CountVectorizer
's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers. If the features get sorted via a call to_sort_features
, the new indices were hard coded to be 32 bit. I've updated that call to make the new indices the same dtype as before.Any other comments?
The issue was more broad than this one case, but I didn't identify any other manual modifications of sparse matrices indices / indptrs in the code. See #7762 for more details on that.