Skip to content

ColumnTransformer requires at least one column for each part it transforms #12071

Closed
@janvanrijn

Description

@janvanrijn

Description

ColumnTransformer requires at least one column for each part it transforms. This sounds logical, but makes automatic experimentation across datasets with mixed input types hard to apply with a single sklearn model. I would need three separate models for:

  • datasets that have only numeric inputs
  • datasets that have only categorical inputs
  • datasets that have mixed (numeric / categorical) inputs

Of course, this is doable, but it would be extremely convenient to be able to do all this with one sklearn model.

Steps/Code to Reproduce

import sklearn
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('iris', 1, return_X_y=True)

numeric_transformer = sklearn.pipeline.make_pipeline(
    sklearn.preprocessing.Imputer(),
    sklearn.preprocessing.StandardScaler())

categorical_transformer = sklearn.pipeline.make_pipeline(
    sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing'),
    sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')
)

transformer = sklearn.compose.ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, []),
        ('nominal', categorical_transformer, [0,1,2,3])],
    remainder='passthrough')

clf = sklearn.pipeline.make_pipeline(transformer, sklearn.tree.DecisionTreeClassifier())

clf.fit(X, y)

Expected Results

a fitted model :)

Actual Results

Traceback (most recent call last):
  File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 25, in <module>
    clf.fit(X, y)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py", line 425, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py", line 371, in _fit_transform
    X=X, fitted=fitted, replace_strings=True))
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 298, in fit_transform
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 462, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 158, in fit
    force_all_finite=False)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/utils/validation.py", line 585, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(150, 0)) while a minimum of 1 is required.

Versions

I just installed the git branch 0.20.X

Proposed solution

I can author a PR that checks the column count, or passes through a constant dummy column

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions