Skip to content

ColumnTransformer does not validate sparse formats for X #30275

Open
@antoinebaker

Description

@antoinebaker

Describe the bug

If the underlying transformers all accept sparse input data, ColumnTransformer should also be able to accept sparse input data. That's indeed the case for the csr, csc, lil and dok formats but it raises errors for the bsr, coo, dia formats because those are not "subscriptable".

As a possible fix, we could validate sparse input data by using accept_sparse=("csr", "csc", "lil", "dok") which will then convert to a "subscriptable" sparse format. Currently it is not done as ColumnTransformer relies on its own _check_X which often entirely bypasses the validation, maybe for performance reasons ?

Steps/Code to Reproduce

import numpy as np
from scipy.sparse import dia_array
from sklearn.compose import ColumnTransformer

rng = np.random.RandomState(1)
X = rng.uniform(size=(10, 3))
y = rng.randint(0, 3, size=10)
X = dia_array(X)

est = ColumnTransformer(transformers=[('trans1','passthrough',[0,1])])
est.fit(X, y)

Expected Results

No error is thrown.

Actual Results

TypeError: 'dia_array' object is not subscriptable

Versions

System:
    python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:32:50) [Clang 16.0.6 ]
executable: /Users/abaker/miniforge3/envs/sklearn-dev/bin/python
   machine: macOS-14.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.2
   setuptools: 73.0.1
        numpy: 2.1.0
        scipy: 1.14.1
       Cython: 3.0.11
       pandas: 2.2.2
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libomp.dylib
        version: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions