Support new scipy sparse array indices, which can now be > 2^31 #6194

mannby · 2016-01-19T21:10:28Z

(< 2^63)

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

lesteve · 2016-01-20T12:43:52Z

Is this related to #6183? From a quick look at it, it seems you are hitting the same limitation.

mannby · 2016-01-20T16:22:52Z

Yes, exactly. The length of the array of feature indexes gets too big for that length to be stored in the second array.

mannby · 2016-01-20T16:24:02Z

Note that making both arrays store longs causes huge memory bloat.

lesteve · 2016-01-21T07:50:07Z

@mannby do you have a stand-alone example to reproduce the problem? If so, it would be great to turn it into a test.

mannby · 2016-01-22T18:18:19Z

Sorry, no stand-alone example. Although I have tested that it works with the fix and doesn't work without it on a large real dataset on a 64-bit system.
It does rely on "l" and "int_" being the same size, though. But, by my reading of the Python docs, both should map to C "long"s, which would be consistent on a given platform.
The second change seems to have the option of specifying "int64", but I don't know of a corresponding way to force the integer to be 64 bits in the first change. So, it should be best to rely on "long" in both cases.

ogrisel · 2016-01-28T10:21:06Z

I think this code should only use 64bit integer if we know for sure that the version of scipy will support it. For instance with something like:

from sklearn.utils.fixes import sp_version

if sp_version >= (0, 14):
    # we can use 64 bit indices.

ogrisel · 2016-01-28T10:36:32Z

It does rely on "l" and "int_" being the same size, though. But, by my reading of the Python docs, both should map to C "long"s, which would be consistent on a given platform.

Please add an inline comment to make that explicit. @larsmans do this seem right to you?

larsmans · 2016-01-28T10:44:33Z

Yes, that seems right.

However, this doesn't change anything on Windows, where long is 32 bits (while scipy.sparse is careful to use 64-bit ints on all platforms). There's no way to get 64-bit integers into an array.array on Windows in Python 2. Python 3 adds long long support, which is what you'd need.

Regarding the mixing of intc and long, SciPy's CSR matrices always use the same type for indices and indptr, so using intc for one of them forces a conversion later.

ogrisel · 2016-01-28T11:56:23Z

Regarding the mixing of intc and long, SciPy's CSR matrices always use the same type for indices and indptr, so using intc for one of forces a conversion later.

Indeed, +1 for always using 64 bit indices both for indices and indptr arrays when scipy version is 0.14 or later.

ogrisel · 2016-01-28T13:06:01Z

sklearn/feature_extraction/text.py

@@ -909,6 +911,10 @@ def _make_int_array():
    """Construct an array.array of a type suitable for scipy.sparse indices."""
    return array.array(str("i"))


str() is not required here. array.array("i") should work both under Python 2 and Python 3.

I get TypeError: must be char, not unicode if the str() is not there.

I get TypeError: must be char, not unicode if the str() is not there..

Interesting, which platform and version of python ? It works fine for me for all the python versions I tried (2.6, 2.7, 3.3, 3.4, 3.5). Python installed through conda on a Ubuntu box, if that matters.

BTW, the str() was there from before, so presumably someone else had the need for the conversion at some earlier point too.

The Python where this failed for me is a UCS-2 build of Python 2.7.10 on a 64-bit OS X system:

Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin

Tweak comments

mannby · 2016-02-09T00:02:28Z

I've followed the recommendations in the latest commit to the pull request.

mannby · 2016-03-01T21:19:12Z

I made the mistake of using my master branch for this pull request.

I'd like to make a new branch, and move this pull request to the branch, although I assume I'll have to create a new pull request from the new branch.

So, I think I'll need to close this pull request, and re-open one from the new branch, to avoid further commits showing up here.

Support new scipy sparse array indices, which can now be > 2^31 (< 2^…

ddcb64d

…63). This is needed for very large training sets. Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.

Also increase size of integer values in indptr in the next step.

c75c0b8

ogrisel reviewed Jan 28, 2016
View reviewed changes

Use long for both arrays if scipy >= 0.14.

3ec2503

Tweak comments

mannby closed this Mar 1, 2016

mannby mentioned this pull request Mar 1, 2016

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

Closed

rth mentioned this pull request Jun 17, 2017

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support new scipy sparse array indices, which can now be > 2^31 #6194

Support new scipy sparse array indices, which can now be > 2^31 #6194

mannby commented Jan 19, 2016

lesteve commented Jan 20, 2016

mannby commented Jan 20, 2016

mannby commented Jan 20, 2016

lesteve commented Jan 21, 2016

mannby commented Jan 22, 2016

ogrisel commented Jan 28, 2016

ogrisel commented Jan 28, 2016

larsmans commented Jan 28, 2016

ogrisel commented Jan 28, 2016

ogrisel Jan 28, 2016

mannby Jan 28, 2016

lesteve Jan 29, 2016

mannby Jan 29, 2016

mannby commented Feb 9, 2016

mannby commented Mar 1, 2016

		@@ -909,6 +911,10 @@ def _make_int_array():
		"""Construct an array.array of a type suitable for scipy.sparse indices."""
		return array.array(str("i"))

Support new scipy sparse array indices, which can now be > 2^31 #6194

Support new scipy sparse array indices, which can now be > 2^31 #6194

Conversation

mannby commented Jan 19, 2016

lesteve commented Jan 20, 2016

mannby commented Jan 20, 2016

mannby commented Jan 20, 2016

lesteve commented Jan 21, 2016

mannby commented Jan 22, 2016

ogrisel commented Jan 28, 2016

ogrisel commented Jan 28, 2016

larsmans commented Jan 28, 2016

ogrisel commented Jan 28, 2016

ogrisel Jan 28, 2016

Choose a reason for hiding this comment

mannby Jan 28, 2016

Choose a reason for hiding this comment

lesteve Jan 29, 2016

Choose a reason for hiding this comment

mannby Jan 29, 2016

Choose a reason for hiding this comment

mannby commented Feb 9, 2016

mannby commented Mar 1, 2016