Skip to content

Using ColumnTransformer when not all specified columns present in data #15781

@larskouwenhoven

Description

@larskouwenhoven

Description

One needs to specify columns to which certain transformers are to be applied. However, when a dataset fed is missing one of the specified columns, an error is raised. I would like to still be able to use the ColumnTransformer, even when some of the specified columns are not present in the data. Is this possible?

Steps/Code to Reproduce

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.pipeline import Pipeline
import sklearn.datasets
import pandas as pd

X, y = sklearn.datasets.fetch_openml('iris', 1, return_X_y=True)
X = pd.DataFrame(X, columns = ["one", "two", "three", "four"])

numeric_features = ["one", "two", "three", "four", "five"]

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value=0)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
    ],
    remainder="drop",
    sparse_threshold=0,
)

preprocessor.fit_transform(X)

Expected Results

A fitted df/array

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 474, in fit_transform
    self._validate_remainder(X)
  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 315, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))
  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 703, in _get_column_indices
    return [all_columns.index(col) for col in columns]
  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 703, in <listcomp>
    return [all_columns.index(col) for col in columns]
ValueError: 'five' is not in list

Versions

System:
    python: 3.8.0 (default, Oct 23 2019, 18:51:26)  [GCC 9.2.0]
executable: /bin/python
   machine: Linux-4.9.130-xxxx-std-ipv6-64-x86_64-with-glibc2.2.5

Python deps:
       pip: 19.2.3
setuptools: 41.6.0
   sklearn: 0.21.3
     numpy: 1.17.4
     scipy: 1.3.1
    Cython: 0.29.14
    pandas: 0.25.3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions