-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support new scipy sparse array indices, which can now be > 2^31 #6194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
Is this related to #6183? From a quick look at it, it seems you are hitting the same limitation. |
Yes, exactly. The length of the array of feature indexes gets too big for that length to be stored in the second array. |
Note that making both arrays store longs causes huge memory bloat. |
@mannby do you have a stand-alone example to reproduce the problem? If so, it would be great to turn it into a test. |
Sorry, no stand-alone example. Although I have tested that it works with the fix and doesn't work without it on a large real dataset on a 64-bit system. |
I think this code should only use 64bit integer if we know for sure that the version of scipy will support it. For instance with something like: from sklearn.utils.fixes import sp_version
if sp_version >= (0, 14):
# we can use 64 bit indices. |
Please add an inline comment to make that explicit. @larsmans do this seem right to you? |
Yes, that seems right. However, this doesn't change anything on Windows, where Regarding the mixing of |
Indeed, +1 for always using 64 bit indices both for |
@@ -909,6 +911,10 @@ def _make_int_array(): | |||
"""Construct an array.array of a type suitable for scipy.sparse indices.""" | |||
return array.array(str("i")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str()
is not required here. array.array("i")
should work both under Python 2 and Python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get TypeError: must be char, not unicode if the str() is not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get TypeError: must be char, not unicode if the str() is not there..
Interesting, which platform and version of python ? It works fine for me for all the python versions I tried (2.6, 2.7, 3.3, 3.4, 3.5). Python installed through conda on a Ubuntu box, if that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the str() was there from before, so presumably someone else had the need for the conversion at some earlier point too.
The Python where this failed for me is a UCS-2 build of Python 2.7.10 on a 64-bit OS X system:
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Tweak comments
I've followed the recommendations in the latest commit to the pull request. |
I made the mistake of using my master branch for this pull request. I'd like to make a new branch, and move this pull request to the branch, although I assume I'll have to create a new pull request from the new branch. So, I think I'll need to close this pull request, and re-open one from the new branch, to avoid further commits showing up here. |
(< 2^63)
This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.