Skip to content

ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rangehow opened this issue Mar 26, 2025 · 7 comments
Open
Labels
Needs Reproducible Code Issue requires reproducible code New Feature

Comments

@rangehow
Copy link

rangehow commented Mar 26, 2025

Describe the workflow you want to enable

The use case that triggers the issue is very simple. I am trying to compute the n-gram features of a tokenized 1M dataset (i.e., from List[str] to List[int]) and then perform clustering on the dataset based on these features.

vectorizer = HashingVectorizer(ngram_range=(1, 5), alternate_sign=False, norm='l1')

# multi processing
X_tfidf = parallel_transform(train_dataset, vectorizer, num_chunks=64)

cluster_func = BisectingKMeans(n_clusters=num_clusters,random_state=42,bisecting_strategy="largest_cluster")
cluster_func.fit(X_tfidf)

However, as the n-gram size or the dataset increases, it is easy to encounter the error shown in the title.

Traceback (most recent call last):
13:50:00.018   File "<frozen runpy>", line 198, in _run_module_as_main
13:50:00.019   File "<frozen runpy>", line 88, in _run_code
13:50:00.019   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 432, in <module>
13:50:00.022     main()
13:50:00.022   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 404, in main
13:50:00.023     clustered_data = ngram_split(train_dataset, max_dataset_size)
13:50:00.023                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.023   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 340, in ngram_split
13:50:00.024     kmeans.fit(X_tfidf)
13:50:00.024   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
13:50:00.032     return fit_method(estimator, *args, **kwargs)
13:50:00.032            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.032   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py", line 2073, in fit
13:50:00.037     X = self._validate_data(
13:50:00.037         ^^^^^^^^^^^^^^^^^^^^
13:50:00.037   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/base.py", line 633, in _validate_data
13:50:00.037     out = check_array(X, input_name="X", **check_params)
13:50:00.037           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.038   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 971, in check_array
13:50:00.041     array = _ensure_sparse_format(
13:50:00.041             ^^^^^^^^^^^^^^^^^^^^^^
13:50:00.041   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 591, in _ensure_sparse_format
13:50:00.041     _check_large_sparse(sparse_container, accept_large_sparse)
13:50:00.041   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1145, in _check_large_sparse
13:50:00.042     raise ValueError(
13:50:00.042 ValueError: Only sparse matrices with 32-bit integer indices are accepted. Got int64 indices. Please do report a minimal reproducer on scikit-learn issue tracker so that support for your use-case can be studied by maintainers. See: https://scikit-learn.org/dev/developers/minimal_reproducer.html
@rangehow rangehow added Needs Triage Issue requires triage New Feature labels Mar 26, 2025
@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

Could you please provide an actual reproducer that we can just copy and paste to reproduce?

I tried the following and cannot reproduce because HashingVectorizer will not use int64 indices by default, apparently.

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import BisectingKMeans
import numpy as np

vectorizer = HashingVectorizer(ngram_range=(1, 5), alternate_sign=False, norm='l1')

X = np.asarray([
    'The quick brown fox jumps over the lazy dog',
    'The lazy brown fox jumps over the brown dog',
    'The lazy brown dog jumps over the quick fox',
], dtype=object)
X_vectorized = vectorizer.fit_transform(X)


bkm = BisectingKMeans(n_clusters=2,random_state=42,bisecting_strategy="largest_cluster")
bkm.fit(X_vectorized)

What is the shape of your vectorized data?

>>> X_vectorized.shape
(3, 1048576)
>>> X_vectorized.indices.dtype
dtype('int32')

@ogrisel ogrisel added Needs Reproducible Code Issue requires reproducible code and removed Needs Triage Issue requires triage labels Mar 28, 2025
@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

If the shape is small enough, maybe you could change parallel_transform to ensure that you are using int32 indices internally instead.

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

Note that I can reproduce the exception as expected if I force the use of int64 indices in the sparse matrix:

X_vectorized.indices = X_vectorized.indices.astype(np.int64)
X_vectorized.indptr = X_vectorized.indptr.astype(np.int64)
bkm.fit(X_vectorized)

@rangehow
Copy link
Author

rangehow commented Mar 28, 2025

but my data is really large indeed, so it can't be represented in int32 range. Basically I'm wondering if there exist some way that we can cluster a really huge dataset with its n-gram features.

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

What is the exact shape? Number of rows and dimensions of X_vectorized (X_vectorized.shape) and number of non-zero entries (e.g. shape of X_vectorized.indices)?

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

Basically I'm wondering if there exist some way that we can cluster a really huge dataset with its n-gram features.

You can probably progressively cluster a dataset with a large number of rows using either MiniBatchKMeans or BIRCH via the use of their partial_fit method.

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2025

You can also preprocess the data incrementally to reduce it's dimensionality using IncrementalPCA that also supports partial_fit although it might not be very efficient on sparse data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Reproducible Code Issue requires reproducible code New Feature
Projects
None yet
Development

No branches or pull requests

2 participants