Skip to content

Support new scipy sparse array indices, which can now be > 2^31 #6194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mannby
Copy link

@mannby mannby commented Jan 19, 2016

(< 2^63)

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

…63).

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
@lesteve
Copy link
Member

lesteve commented Jan 20, 2016

Is this related to #6183? From a quick look at it, it seems you are hitting the same limitation.

@mannby
Copy link
Author

mannby commented Jan 20, 2016

Yes, exactly. The length of the array of feature indexes gets too big for that length to be stored in the second array.

@mannby
Copy link
Author

mannby commented Jan 20, 2016

Note that making both arrays store longs causes huge memory bloat.

@lesteve
Copy link
Member

lesteve commented Jan 21, 2016

@mannby do you have a stand-alone example to reproduce the problem? If so, it would be great to turn it into a test.

@mannby
Copy link
Author

mannby commented Jan 22, 2016

Sorry, no stand-alone example. Although I have tested that it works with the fix and doesn't work without it on a large real dataset on a 64-bit system.
It does rely on "l" and "int_" being the same size, though. But, by my reading of the Python docs, both should map to C "long"s, which would be consistent on a given platform.
The second change seems to have the option of specifying "int64", but I don't know of a corresponding way to force the integer to be 64 bits in the first change. So, it should be best to rely on "long" in both cases.

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2016

I think this code should only use 64bit integer if we know for sure that the version of scipy will support it. For instance with something like:

from sklearn.utils.fixes import sp_version

if sp_version >= (0, 14):
    # we can use 64 bit indices.

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2016

It does rely on "l" and "int_" being the same size, though. But, by my reading of the Python docs, both should map to C "long"s, which would be consistent on a given platform.

Please add an inline comment to make that explicit. @larsmans do this seem right to you?

@larsmans
Copy link
Member

Yes, that seems right.

However, this doesn't change anything on Windows, where long is 32 bits (while scipy.sparse is careful to use 64-bit ints on all platforms). There's no way to get 64-bit integers into an array.array on Windows in Python 2. Python 3 adds long long support, which is what you'd need.

Regarding the mixing of intc and long, SciPy's CSR matrices always use the same type for indices and indptr, so using intc for one of them forces a conversion later.

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2016

Regarding the mixing of intc and long, SciPy's CSR matrices always use the same type for indices and indptr, so using intc for one of forces a conversion later.

Indeed, +1 for always using 64 bit indices both for indices and indptr arrays when scipy version is 0.14 or later.

@@ -909,6 +911,10 @@ def _make_int_array():
"""Construct an array.array of a type suitable for scipy.sparse indices."""
return array.array(str("i"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str() is not required here. array.array("i") should work both under Python 2 and Python 3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get TypeError: must be char, not unicode if the str() is not there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get TypeError: must be char, not unicode if the str() is not there..

Interesting, which platform and version of python ? It works fine for me for all the python versions I tried (2.6, 2.7, 3.3, 3.4, 3.5). Python installed through conda on a Ubuntu box, if that matters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the str() was there from before, so presumably someone else had the need for the conversion at some earlier point too.

The Python where this failed for me is a UCS-2 build of Python 2.7.10 on a 64-bit OS X system:

Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin

@mannby
Copy link
Author

mannby commented Feb 9, 2016

I've followed the recommendations in the latest commit to the pull request.

@mannby
Copy link
Author

mannby commented Mar 1, 2016

I made the mistake of using my master branch for this pull request.

I'd like to make a new branch, and move this pull request to the branch, although I assume I'll have to create a new pull request from the new branch.

So, I think I'll need to close this pull request, and re-open one from the new branch, to avoid further commits showing up here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants