-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please provide an actual reproducer that we can just copy and paste to reproduce? I tried the following and cannot reproduce because from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import BisectingKMeans
import numpy as np
vectorizer = HashingVectorizer(ngram_range=(1, 5), alternate_sign=False, norm='l1')
X = np.asarray([
'The quick brown fox jumps over the lazy dog',
'The lazy brown fox jumps over the brown dog',
'The lazy brown dog jumps over the quick fox',
], dtype=object)
X_vectorized = vectorizer.fit_transform(X)
bkm = BisectingKMeans(n_clusters=2,random_state=42,bisecting_strategy="largest_cluster")
bkm.fit(X_vectorized) What is the shape of your vectorized data? >>> X_vectorized.shape
(3, 1048576)
>>> X_vectorized.indices.dtype
dtype('int32') |
If the shape is small enough, maybe you could change |
Note that I can reproduce the exception as expected if I force the use of X_vectorized.indices = X_vectorized.indices.astype(np.int64)
X_vectorized.indptr = X_vectorized.indptr.astype(np.int64)
bkm.fit(X_vectorized) |
but my data is really large indeed, so it can't be represented in int32 range. Basically I'm wondering if there exist some way that we can cluster a really huge dataset with its n-gram features. |
What is the exact shape? Number of rows and dimensions of |
You can probably progressively cluster a dataset with a large number of rows using either |
You can also preprocess the data incrementally to reduce it's dimensionality using |
Describe the workflow you want to enable
The use case that triggers the issue is very simple. I am trying to compute the n-gram features of a tokenized 1M dataset (i.e., from List[str] to List[int]) and then perform clustering on the dataset based on these features.
However, as the n-gram size or the dataset increases, it is easy to encounter the error shown in the title.
The text was updated successfully, but these errors were encountered: