ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

rangehow · 2025-03-26T03:34:58Z

Describe the workflow you want to enable

The use case that triggers the issue is very simple. I am trying to compute the n-gram features of a tokenized 1M dataset (i.e., from List[str] to List[int]) and then perform clustering on the dataset based on these features.

vectorizer = HashingVectorizer(ngram_range=(1, 5), alternate_sign=False, norm='l1')

# multi processing
X_tfidf = parallel_transform(train_dataset, vectorizer, num_chunks=64)

cluster_func = BisectingKMeans(n_clusters=num_clusters,random_state=42,bisecting_strategy="largest_cluster")
cluster_func.fit(X_tfidf)

However, as the n-gram size or the dataset increases, it is easy to encounter the error shown in the title.

Traceback (most recent call last):
13:50:00.018   File "<frozen runpy>", line 198, in _run_module_as_main
13:50:00.019   File "<frozen runpy>", line 88, in _run_code
13:50:00.019   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 432, in <module>
13:50:00.022     main()
13:50:00.022   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 404, in main
13:50:00.023     clustered_data = ngram_split(train_dataset, max_dataset_size)
13:50:00.023                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.023   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/INS/ruanjunhao04/ruanjunhao/ndp/ndp/marisa_onlyclm.py", line 340, in ngram_split
13:50:00.024     kmeans.fit(X_tfidf)
13:50:00.024   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
13:50:00.032     return fit_method(estimator, *args, **kwargs)
13:50:00.032            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.032   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py", line 2073, in fit
13:50:00.037     X = self._validate_data(
13:50:00.037         ^^^^^^^^^^^^^^^^^^^^
13:50:00.037   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/base.py", line 633, in _validate_data
13:50:00.037     out = check_array(X, input_name="X", **check_params)
13:50:00.037           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13:50:00.038   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 971, in check_array
13:50:00.041     array = _ensure_sparse_format(
13:50:00.041             ^^^^^^^^^^^^^^^^^^^^^^
13:50:00.041   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 591, in _ensure_sparse_format
13:50:00.041     _check_large_sparse(sparse_container, accept_large_sparse)
13:50:00.041   File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-aipnlp/ruanjunhao04/env/rjh/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1145, in _check_large_sparse
13:50:00.042     raise ValueError(
13:50:00.042 ValueError: Only sparse matrices with 32-bit integer indices are accepted. Got int64 indices. Please do report a minimal reproducer on scikit-learn issue tracker so that support for your use-case can be studied by maintainers. See: https://scikit-learn.org/dev/developers/minimal_reproducer.html

ogrisel · 2025-03-28T10:37:25Z

Could you please provide an actual reproducer that we can just copy and paste to reproduce?

I tried the following and cannot reproduce because HashingVectorizer will not use int64 indices by default, apparently.

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import BisectingKMeans
import numpy as np

vectorizer = HashingVectorizer(ngram_range=(1, 5), alternate_sign=False, norm='l1')

X = np.asarray([
    'The quick brown fox jumps over the lazy dog',
    'The lazy brown fox jumps over the brown dog',
    'The lazy brown dog jumps over the quick fox',
], dtype=object)
X_vectorized = vectorizer.fit_transform(X)


bkm = BisectingKMeans(n_clusters=2,random_state=42,bisecting_strategy="largest_cluster")
bkm.fit(X_vectorized)

What is the shape of your vectorized data?

>>> X_vectorized.shape
(3, 1048576)
>>> X_vectorized.indices.dtype
dtype('int32')

ogrisel · 2025-03-28T10:38:36Z

If the shape is small enough, maybe you could change parallel_transform to ensure that you are using int32 indices internally instead.

ogrisel · 2025-03-28T10:40:01Z

Note that I can reproduce the exception as expected if I force the use of int64 indices in the sparse matrix:

X_vectorized.indices = X_vectorized.indices.astype(np.int64)
X_vectorized.indptr = X_vectorized.indptr.astype(np.int64)
bkm.fit(X_vectorized)

rangehow · 2025-03-28T12:29:59Z

but my data is really large indeed, so it can't be represented in int32 range. Basically I'm wondering if there exist some way that we can cluster a really huge dataset with its n-gram features.

ogrisel · 2025-03-28T13:13:57Z

What is the exact shape? Number of rows and dimensions of X_vectorized (X_vectorized.shape) and number of non-zero entries (e.g. shape of X_vectorized.indices)?

ogrisel · 2025-03-28T13:16:04Z

Basically I'm wondering if there exist some way that we can cluster a really huge dataset with its n-gram features.

You can probably progressively cluster a dataset with a large number of rows using either MiniBatchKMeans or BIRCH via the use of their partial_fit method.

ogrisel · 2025-03-28T13:18:19Z

You can also preprocess the data incrementally to reduce it's dimensionality using IncrementalPCA that also supports partial_fit although it might not be very efficient on sparse data.

rangehow added Needs Triage Issue requires triage New Feature labels Mar 26, 2025

ogrisel added Needs Reproducible Code Issue requires reproducible code and removed Needs Triage Issue requires triage labels Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

rangehow commented Mar 26, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025

ogrisel commented Mar 28, 2025

rangehow commented Mar 28, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025

ogrisel commented Mar 28, 2025

ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

ValueError: Only sparse matrices with 32-bit integer indices are accepted. #31073

Comments

rangehow commented Mar 26, 2025 • edited Loading

Describe the workflow you want to enable

ogrisel commented Mar 28, 2025 • edited Loading

ogrisel commented Mar 28, 2025

ogrisel commented Mar 28, 2025

rangehow commented Mar 28, 2025 • edited Loading

ogrisel commented Mar 28, 2025 • edited Loading

ogrisel commented Mar 28, 2025

ogrisel commented Mar 28, 2025

rangehow commented Mar 26, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025 •

edited

Loading

rangehow commented Mar 28, 2025 •

edited

Loading

ogrisel commented Mar 28, 2025 •

edited

Loading