Open
Description
Describe the bug
If the underlying transformers all accept sparse input data, ColumnTransformer
should also be able to accept sparse input data. That's indeed the case for the csr
, csc
, lil
and dok
formats but it raises errors for the bsr
, coo
, dia
formats because those are not "subscriptable".
As a possible fix, we could validate sparse input data by using accept_sparse=("csr", "csc", "lil", "dok")
which will then convert to a "subscriptable" sparse format. Currently it is not done as ColumnTransformer
relies on its own _check_X
which often entirely bypasses the validation, maybe for performance reasons ?
Steps/Code to Reproduce
import numpy as np
from scipy.sparse import dia_array
from sklearn.compose import ColumnTransformer
rng = np.random.RandomState(1)
X = rng.uniform(size=(10, 3))
y = rng.randint(0, 3, size=10)
X = dia_array(X)
est = ColumnTransformer(transformers=[('trans1','passthrough',[0,1])])
est.fit(X, y)
Expected Results
No error is thrown.
Actual Results
TypeError: 'dia_array' object is not subscriptable
Versions
System:
python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:32:50) [Clang 16.0.6 ]
executable: /Users/abaker/miniforge3/envs/sklearn-dev/bin/python
machine: macOS-14.5-arm64-arm-64bit
Python dependencies:
sklearn: 1.6.dev0
pip: 24.2
setuptools: 73.0.1
numpy: 2.1.0
scipy: 1.14.1
Cython: 3.0.11
pandas: 2.2.2
matplotlib: 3.9.2
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libopenblas.0.dylib
version: 0.3.27
threading_layer: openmp
architecture: VORTEX
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libomp.dylib
version: None